Learning what we know and knowing what we learn: Gaussian process priors for neural data analysis
Date Posted:
July 12, 2021
Date Recorded:
July 8, 2021
Speaker(s):
Guillaume Hennequin, Kris Jensen
All Captioned Videos Computational Tutorials
Loading your interactive content...
Description:
Guillaume Hennequin, Kris Jensen - University of Cambridge
Colab notebooks:
Additional papers and resources
PRESENTER: Today, we have a talk called learning what we know and knowing what we learn, Gaussian process priors for neural data analysis. We have two speakers. The first is Dr. Guillaume Hennequin, who is an associate professor in the Computational and Biological Learning Lab at the University of Cambridge, where he uses probabilistic machine learning and control theory to analyze and model neural activity in rodents and non-human primates.
Kris Jensen is a PhD student and Gates Cambridge scholar in the Hennequin lab and develops probabilistic latent variable models to better understand how behavior arises from the collective activity of large populations of neurons. And so I am very excited to hear them talk about the Gaussian process priors today. So I'll hand it over to Chris.
KRIS JENSEN: Hello, everyone. Yeah, thank you very much for the kind introduction. And as mentioned, my name is Kris, and I'm a PhD student with Guillaume. And today, we'll be talking a little bit about probabilistic latent variable models, and in particular, those models that use Gaussian processes as a prior [INAUDIBLE] for their latent variables. And so we have this, maybe, slightly cryptic title, learning what we know and knowing what we learn, which will hopefully make more sense by the end of the talk.
But just to unpack it a little bit, the learning what we know bit is, really, this idea that by incorporating the prior knowledge that we have about natural systems and behavior as priors in these latent variable models, that can help us learn better representations that, in turn, can inform our analyses of neural activity and behavior. And knowing what we learn-- what we mean by that is that we really think that by understanding these assumptions and this prior knowledge that goes into the latent variable models, this can maybe help us also understand and interpret the latent trajectories and these representations that come out the other end.
And so upfront, I also just want to-- or we just want to say thanks to a bunch of other people who've been involved in much of this work-- so these are various lab members, past and present-- and in particular, to Ginny Rutten, who developed a method known as GPFADS, which we'll talk about a little bit towards the end, and Calvin Kao, who is a recent PhD student from the lab who has been involved in a lot of this work as well.
And just for kind of a quick overview, I'll start by giving an introduction to a range of probabilistic latent variable models in neuroscience. And this is really going to be a very quick overview, just as a bit of a teaser. And then, Guillaume will talk more specifically about Gaussian processes before we'll put these ideas together and dive deeper into the use of Gaussian processes as priors in these latent variable models. And this will also involve two example Colab notebooks where there will be links to these in the chat. And you can either follow along in those, or otherwise, I will kind of be going through everything on the screen as well. If you just want to follow along there and look at the notebooks later, that's also perfectly fine.
OK, so with that, let's get started. And the first question we might be wondering about is, why should we bother about these latent variable models? And here's just a very simple example of where we've used them. And this is for a non-human primate task where you have a monkey, and the monkey is, maybe, reaching towards a point in a screen. So this is a continuous reaching task where the monkey reaches towards a target, and then a new target lights up, and the monkey reaches towards that target, and so on, and so forth.
And you might imagine that you're recording from a bunch of neurons as the monkey's performing this task. And we might record from several hundred neurons, or with Neuropixels, even 1,000 neurons. And it's maybe not immediately obvious what this neural activity actually means or how it relates to the behavior. And in particular, the activity of each individual neuron tends to be very noisy and can maybe be quite difficult to relate to anything, either in the external world or latent decision-type processes happening within the brain.
And so our first step to analyzing such data is often to do some sort of dimensionality reduction or fitting some kind of variable model to extract a lower dimensional, and hopefully, more meaningful and easier to interpret signal. And so obviously, this is a topic that's been studied a lot in neuroscience. And so what are the things that people tend to use? Well, here's, perhaps, the most commonly used model, which is just principle component analysis.
And here's just an example from Sohn et al., where they basically use it to analyze this task where a monkey has to measure out a particular amount of time and then wait that amount of time before responding. And they can look at-- these different colors indicate different time delays that the subject has to measure out. And they can kind of see that when the monkey has to measure out a short period of time, then you get these fast neural trajectories in red. And when it has to be-- for a short amount of time, you get fast neural trajectories in red. When it has to measure out a long amount of time, then you get these rather slow neural trajectories in blue.
And this all looks very nice and smooth and everything, but this also relies on having a large set of trials, which allows us to really trial average this neural activity, because principal component analysis, otherwise, can be quite sensitive to trial-to-trial noise. And so to try to address this, Byron Yu et al. developed Gaussian process factor analysis in 2009, which is commonly used to analyze single trial data. And this really allows for information sharing across time, which can help de-noise these single trial latent trajectories, which is what's plotted in gray here.
And an assumption of both principal component analysis and Gaussian process factor analysis is that there's a linear relationship between the latent variables and the observations of neural activity, which in some cases, might not be appropriate, for example, in the case of place cells, which were analyzed by Anqi Wu in Jonathan Pillow's lab a couple of years ago. And so instead, they fitted a Gaussian process latent variable model, which instead of using the Gaussian process as a prior over the latent trajectories, actually uses it as a non-linear observation model. And this allows them to fit these kind of nonlinear, behaviorally relevant tuning curves, which in this case, corresponds roughly to place fields.
Another kind of approach to these latent variable models relies on the use of-- or the rise of deep learning. And this is, for example, things like PfLDS, which combines a linear dynamical system, as, for example, used in the classical Kalman filter, with a readout that is basically a deep network, which you can learn by fitting it to a large quantity of neural data. And in this case, it's used to recover a latent trajectory corresponding to the phase of a [? drift in ?] grading. And it does so very well indeed.
Another common-- well, not common, but another well-known method that has recently risen to prominence is LFADS, which many might have heard of as well. And similarly to PfLDS, this builds in this idea that you can model the brain as something resembling a dynamical system, in this case, a non-linear dynamical system, with a linear readout. So this is kind of the reverse of PfLDS, which is a linear dynamical system with a nonlinear readout.
And this has, for example, been used a lot in the modern neuroscience literature, where it's been really, really powerful and very good at extracting neural activity or latent trajectories that can recapitulate motor output very well. And so this really relies on this knowledge that we have from previous studies that motor cortex and other motor regions can be quite well described as dynamical systems. And so basically built this knowledge into their latent variable model to provide a very powerful method for extracting information from the recorded neural activity.
And so you can kind of see these first three methods that I mentioned as non-parametric methods, which don't make an explicit assumption about the relationship between the different latent states over time, apart from the fact that in the case of this Gaussian process factor analysis and Gaussian process LVMs, there's some kind of smoothness constraint. Whereas, these two subsequent methods, PfLDS and LFADS are dynamical systems methods or parametric methods that really learn a parametric dynamical system that predicts how this latent trajectory is going to evolve in a-- not deterministic manner, but at least a predictable manner.
And so recent work by Ginny Rutten in Guillaume's lab has gone into trying to bridge this gap by developing a method known as GPFADS, or Gaussian Process Factor Analysis with Dynamical Systems. And what this is trying to do is really use the non-parametric nature of the Gaussian process as a prior, but combine it with this idea that these latent trajectories might share certain features with dynamical systems, such as non-reversibility.
And so here's an example where they basically extracted two latent planes when applying this method to a monkey motor cortex and found that, in particular, one aspect of this neural data really showed signs of such non-reversibility. And this is something that Guillaume will talk a lot more about later on. And so clearly there are a lot of different methods, and these make a lot of different assumptions and contain a lot of different model components. And so one question might be, OK, so we want to fit one of these latent variable models.
Clearly, there's a really large toolbox of things we can fit. And we could even come up with our own. So where do we even begin? How do we even begin to think about these models in a fairly unified framework? And the answer, at least for us, is to use Bayesian inference and to think about this in a probabilistic setting. And this is really how a lot of these models are phrased-- are framed. And this helps us both understand existing models and also, potentially, develop new models that might better-- fit better with our understanding of a particular task or region of the brain and build in this domain knowledge.
And so a quick overview of how we go about fitting or motivating these probabilistic latent variable models, the first thing we can really do is that we record some data. So we have our data, which we'll call Y, and this is, really, what we want to explain somehow. And the way we want to explain the data is by inferring latent variables. So this is what we call X. This is this de-noised, lower-dimensional representation that we hope can tell us something about the brain or behavior.
And in order to extract this information, what we first need to do is we need to define a prior. And this is what we'll spend a lot of time talking about today. And what this prior is saying is, basically, what do we think is going to be true about these latent variables before we even see any data? And this, here-- so this where we can incorporate, for example, Gaussian processes, which allows us to build in this intuition that there should be some kind of information sharing across time, not just across neurons.
We also need to define an observation model, which in many cases, such a GPFA, or factor analysis, is just a linear readout model. But it could also be more complicated, such as a full deep neural network, as in PfLDS. And both the prior and this observation model may have additional parameters that we need to learn. The way we learn these parameters is by training the model.
And the most common way to do training is basically by coming up with some kind of maximum likelihood estimate of the parameters for both our prior and observation model. And what this means is, basically, we say that the best parameters we can find are the ones that maximize the probability of our data, given those parameters. And this quantity, p of Y here, is basically integral over p of Y given X, p of X. And it's known as the margin of likelihood, and this is also something that Guillaume will be talking a lot more about later on.
And once we've trained a model we can finally do inference. And inference is really-- that was the end goal. That was extracting this latent variable X, so this latent trajectory X. And the way we do that is basically by applying Bayes' theorem, which says that the probability distribution over X, given Y and our parameters here, is base proportional to p of Y given X times our prior X. And this is really where we allow our prior to constrain what we kind of get out in the other end.
And so this is really quite a powerful framework, and it's very flexible. And it allows us to both build in this prior domain knowledge and deal with uncertainty. So we get a whole distribution of our latent trajectories. But it also has disadvantages, notably, that this can be intractable. In particular, the training step and the inference step can be intractable for complicated models, and we might have resort to some kind of approximate inference, which is also something we'll talk about later.
And so this was really meant as a very brief overview of how do we think about these things, how do other people think about these things, what are the existing models, and where do we start from. And now, we're going to take a bit of that step back and think more precisely about this prior distribution here and how we can use Gaussian processes to model time series data or other data that might be evolving smoothly over some domain.
AUDIENCE: I have a quick question.
KRIS JENSEN: OK.
AUDIENCE: So out of all the-- so this is from some of your introductory slides. Out of all the methods that you specify which are used for data analysis, only PCA is the one that requires a large number of trials?
KRIS JENSEN: No. I mean, so you could argue that they all require a large number of trials. But by building in prior knowledge, you can make the method less data hungry, in a sense. Because if you, for example, build in this prior assumption that things are continuous in time, then for a given time step, both the data from the time step before and the data from the time step after is going to help inform what the latent state is at your time step, t.
And so in that sense, you can get more confidence in your latent states and less noisy latent states with a small amount of data. Whereas PCA, because it doesn't make any of these assumptions, it-- to get the same level of de-noising, in a sense, you need more data or trial averaging of some sort.
But you also have some of these other methods, for example, PfLDS or LFADS, still have a lot of parameters. And so to learn all of these parameters, you're still going to need a large amount of data. And one of the reasons we like Gaussian-process-based methods is that it allows us to use data very efficiently, and therefore, is well-suited for single trial analyses or even analyses of data without any trial structure.
AUDIENCE: OK. That makes sense. Thank you.
GUILLAUME HENNEQUIN: OK. So I'm going to give you a brief primer on Gaussian processes. And I'm really aiming at a very conceptual introduction, almost equation-free, actually. So this is really aimed at people who've never heard of Gaussian processes, or perhaps have heard maybe once or twice about Gaussian processes and need a refresher. For those of you who are very familiar with GPs already, feel free to go and have coffee. And that should be about 20 minutes long, I think.
OK, so Gaussian processes-- so even though my introduction is going to be conceptual, I just want to point out that there's this book, do a bit of advertisement for Carl Rasmussen at CBL, here in Cambridge. He wrote this beautiful book called Gaussian Processes for Machine Learning, and this is where you should go, really, if you want to know more about the details.
OK, so there are various ways of introducing Gaussian processes, and I'm going to take the simple way, which is to start with a simple regression problem. So say you've measured something interesting here on the y-axis over some time interval, and in particular, at a discrete time points. So I'm actually showing you some of these measurements. And what I'm asking us to do collectively is some sort of mental curve fitting exercise where you have to think about, OK, what do you think is the function of time that underlies these observations.
So could you start thinking in your head how you would join these dots by using some line. And I'm pretty sure that nobody in their right mind here would actually think of having a curve that starts at that point here on the left and shoots up by a million on the y-axis before going back down to hone in on the second point. So you all have some prior assumption about what the curve might look like, something that looks reasonable.
And so the answer clearly depends on prior beliefs. One particular prior belief that-- so for example, something that I need to highlight upfront is, maybe, you might even believe that your data, your measurements are very noisy. If your measurements are noisy, I think everyone sensible will actually just draw a flat line and think that this is what underlies these observations. Because if you squint a bit or move back from your screen, all you see is just a bunch of dots on a roughly horizontal axis.
So it depends on prior beliefs. And one common prior belief that we have about latent processes that might be unfolding in the brain-- because everything that then flows through a particular dynamical system that itself drives the body into behavior-- we believe that they are somehow smooth in time. So this is a tentative answer at joining these dots if you believe the function is smooth.
But obviously, it's not-- there there's more than one way to smoothly interpolate between these points. So you could do this. You could do that. You could do that. You could do that. You could do that. And if there is anything you can take out of this tutorial at the end is how to actually make these illustrations. How do you draw lines that all look smooth and pass by these points? That's kind of GPs in a nutshell.
OK, so because there is more than one way to smoothly interpolate between these points, one thing that would be really nice to have is somehow model the uncertainty that we have about what on Earth is the function doing between the points where we have no observations. So those two things that I've written in bold here, prior beliefs and modeling uncertainty, they are very naturally approached in a probabilistic framework, using the language of probability theory.
And this is what Gaussian processes do for you. So they are introducing a particular type of probability distribution over functions. And as soon as you have-- as soon as you can manipulate the probability distribution over functions, then you're in a position to use Bayes' theorem, as Kris mentioned earlier, to actually stop modeling what you do not know about that function given what you observe.
So there's stuff that you know. You observe the function at this point. But there's a lot of stuff that you don't actually know, and that's what's in between. And it is particularly important in machine learning and in other fields that use Gaussian processes to-- when it comes to, for example, making decisions based on what you believe is true between the points, it's actually very important to model your uncertainty, to know what you don't know.
So this is kind of the end goal of what we call Gaussian process regression. It's to essentially trace a curve that represents the most likely underlying function that underlies these observations, but also to put some error bars around that most likely function. And the question is, well, how do we go about making up these error bars and finding this mean function. So that's what I'm going to be explaining next.
So Kris already mentioned this thing called the standard Bayesian recipe, or Bayesian inference. And conceptually, it's really super simple. It just fits in these two lines. You introduce a prior of a function, p of f-- and I'm very loose in terms of notations here, I'm really speaking very conceptually-- and an observation model, which is going to give you a density of observation given the true underlying function.
So here, for example, I've chosen an observation model that is almost noise-free. I'm kind of assuming in this graph that whenever I observe a measurement, for example, something slightly negative at 0.65, I know for sure that the function was exactly that value. Because I am assuming that there is very, very low measurement noise. But you could incorporate arbitrary noise models in there too.
And Kris will also showcase one application of that where you can even use likelihood functions, observation models that are appropriate for neuroscience data, such as [? specking ?] data. So once you've got yourselves these two ingredients-- so these are modeling choices that you have to make. And I'll get back to that. And there might be some parameters involved in these models that you need to tweak.
Then fitting, this mental curve-fitting exercise that we just did, just becomes an inference problem where you just form the posterior distribution over function, given your observations. And so I'm showing here the posterior mean and the posterior variance around that. And that's basically just given by the product of these two. There's only one way to mathematically correctly combine the prior and the observation model, it's just to multiply these two densities together, these two functions together.
OK, so I'm now going to tell you what these Gaussian process priors really are. So how do we construct the prior over functions that incorporates intuitive assumptions about the latent function, for example, smoothness. OK, so I think the best way to explain what we mean by a prior over function is to actually show you samples from a particular prior.
So say, for example, we believe that, before even observing any data-- so the prior-- we believe that the function-- the underlying-- the function that's going to underlie the data we're going to be seeing at the moment is smooth. And say, on top of that, that we expect that it's roughly 0 on average and that it doesn't stray too far on the y-axis, staying within a minus 3 to 3 range, or something.
You've got some idea of the scale of that function, and you've got some idea of the smoothness. And here, by smoothness, I mean that the value of the function is not expected to change dramatically on some particular time scale. So for example, here, between 0.4 and 0.5, it doesn't change very much. So here, are our particular draws that are compatible with these assumptions that I've just spelled out in plain English. So these are draws from a Gaussian process, and I'll explain how I actually drew them.
So the way we modeled smoothness is by considering the covariance. So how do we express the fact that the function doesn't change to much on a particular time scale. So say you take two neighboring time points here, t and t prime. And you can see here that across draws, across trials, you can see that the value of the function at the orange-- at these two time points covary very strongly.
So the two orange dots are both high and the two green dots are both low, et cetera. So they will covary very much from trial to trial. And then as you move away, if you consider a t prime that's much later, several characteristic timescales away, then you can see that this correlation is much weaker. There's no systematic relationship between the dots of the same color when t prime is far away from t. So you can actually cook up a particular covariance function which expresses just this.
You can just manually say, OK, I want the covariance, I want the similarity between the function values at any pair of t and t prime to be given by this. So this is a covariance function. And this is the one that I used to actually draw from-- to draw these samples. So this is the so-called squared exponential covariance function. It's the most often used covariance in Gaussian processes. And it's a simple banded matrix. We see that correlations just fade away as t and t prime grow apart.
OK, I'm not planning to actually show you equations for how to manipulate these matrices and how to draw samples. But all I'm going to tell you is that, first of all, it's in the book. Second of all, it's really not that complicated. So do not shy away from giving it a try. At the end of the day, it's just-- I've only used two operations-- matrix multiplication and an SVD. You can also use a Cholesky decomposition, but an SVD does as well. And these are two very simple operations that you can do in MATLAB, Python, or whatever.
OK so we see-- so the point of this slide is that we can express our assumptions about, for example, smoothness of the latent function as-- in terms of a covariance. And Gaussian processes, they basically stop here. So instead of specifying high order moments, they just say, OK, I'm just going to specify the mean-- and in this case, it was 0-- and a particular covariance, and then I'm going to choose a probability distribution that is fully specified by these two things.
And if you search the stats literature for one such distribution, then the Gaussian distribution comes up. So a Gaussian process is essentially a process where the function value at any two points-- so you've got two function values evaluated at t and t prime-- is jointly Gaussian. And so it's just fully specified by the mean at these two points and the covariance at these two points. So that's it.
OK, so I'm going to walk you through-- just to give you a bit of intuition of what sort of assumptions you can encode in these covariances to show you that it's quite flexible, I'm going to walk you through just a couple of examples of common kernels. I just find it-- when I learned about GPs, I found it quite illuminating to think about these various covariance functions. So this is the squared exponential kernel, which I've already shown you.
This here, at the top, is the equation that describes-- it's a stationary covariance functions. So if you take a slice here at the dashed line in the middle of the matrix, this is just a slice that shows you how the covariance varies for fixed t prime as a function of t. So it's just a function of the difference between the two. And this is just a fading squared exponential kernel, a bit like a Gaussian kernel, actually, except that it's-- we don't mean Gaussian processes in this sense. OK, and I'm showing you a couple of draws again.
So here, there is a characteristic length scale, l, that divides tao, the difference between t and t prime, in this expression. And I believe I've just set it to 1 here on this-- relative to this time axis, from 0 to 10. So you can see that the function very smoothly, roughly on a time scale of 1 time unit.
OK, here's another example. So you can use this expression instead. It's a valid covariance function. And you can see here that it has a bit more of a tail, which means that there are going to be correlations on-- so it's also a bit more peaked.
So we're going to have a bit faster wiggles, but also a bit more power in the power spectrum at very low frequencies. I'm just showing you a few draws. So this one is not particularly characterized by one specific length scale, but it's got a bit more than a single length scale in it.
OK, one that is good for modeling things that are oscillatory is this so-called spectral mixer kernel. And this is just the product of the square exponential term with a cosine. And this is the associated covariance. And these are a couple of draws. So you can see that, roughly, the function is ringing at a particular frequency given by omega here. But apart from that, there is an end flow that varies according to a squared exponential kernel that is random.
So you can see how-- we already see that there's quite a bit of flexibility. And then, of course, you can go to a bit more crazy kernels that-- for example, this one is the so-called [INAUDIBLE] process kernel, or a Laplacian kernel, and it just decays away as an exponential.
And the fact that there's a bit of a kink at the-- at 0 time lag here results in these non-differentiable processes. So it still continues, but it's not differentiable. So this is not one that we would typically use in neuroscience. Because we're trying to smooth things over, and so if you use that, you end up, particularly, with latent trajectories that are continuous, but not necessarily smooth. OK, a couple of draws.
OK, so this is all I wanted to say about the prior itself. And then I'm going to kind of walk you through how we actually use these things, and in particular, how do we-- so how do we use a Gaussian process prior in this particular simple regression context? And this is going to allow me to talk about what it means to actually train a GP model. So far, I've just told you about, OK, we can pick some parameters and draw from the prior. And we have a rough idea of how the parameters affect the samples or affect the prior distribution.
Now I'm going to actually tell you how to use it and how to adjust the parameters if need be. So say we approach a particular regression problem with these data points in the following ways. So we place a Gaussian process prior over the underlying function, f, with, say, the RBF, or squared exponential kernel-- that's just another name for the squared exponential kernel-- with some timescale, l.
Then I'm going to introduce an observation model, as I promised before. And as Kris mentioned earlier, I'm going to say that I believe that my observations are the results of adding some noise to the actual ground truth function. And that noise is going to be characterized by some variance sigma squared. It's just going to be Gaussian noise with some variance sigma squared.
Here, in this particular plot, I actually chose sigma squared equals 0 and length scale of 0.1. So all that's happening in this plot is that we've specified the parameters of both the likelihood function and the prior. We turn the Bayesian crank, and out comes this posterior distribution. And I'm just showing you the mean and-- flanked by plus or minus 2 standard deviations, so 95% confidence intervals on that function.
OK, now you might kind of start to wonder, well, I mean, you picked l equals 0.1 and sigma equals 0, but what made you believe that these were reasonable choices? As I said earlier, if you believe that the measurements are noisy, for example, like the result, presumably, the answer would be very different.
If you believe that the function varies on much faster length scales or much shorter length scales, then the answer might also look different. So the question is, well, I mean, first of all, how does this [? feed ?] vary with these parameters. And second, how do we go about tuning these parameters in the most sensible way?
So here's, for example, what happens if I keep the same length scale but I will allow some slack. I'm acknowledging the fact that there might be measurement noise, and therefore, I don't necessarily believe that the function passes exactly through these points. So you can see that, all of a sudden, there's a bit more of an error bar around these points.
And you see that it starts to affect the shape of your posterior. For example, the mean here is compromising a little bit. The red curve used to pass exactly through the second-- through the third dot, but it's now passing-- a bit overshooting here. So it starts from that compromise between-- to accommodate length scale and the noise.
So here's one where I also acknowledge the fact that there might be noise, but I believe a priori the length scale might be much longer, the timescale might be 1, which means, essentially, that this function is roughly flat over that 0 to 1 time interval. And again, I get a completely different answer. So as I said, the question is, well, how do we pick these hyperparameters.
And again, this is where having a fully probabilistic formulation of the problem really helps you think about how to tweak your parameters. And essentially, all you have to do is to realize that these parameters affect the overall probability that your generative model [? flashed ?] at the top might have generated the data that you observe.
What therefore, makes the most sense to do is to just maximize that probability that the model itself, irrespective of the exact detail of the underlying function-- that the model itself could have generated the data. And by how the model generates the data, I literally mean that the model would first draw from the prior, a particular f, and then draw a particular noise around those values. And then the question is, well, does my data look like they could reasonably have come from that sampling process.
And the way to approach this is basically to marginalize out f, that is to integrate how well a given function f expands the data over what you believe a priori that function to be. So this is called a margin likelihood, and this is something that in the Bayesian literature, you will see that people often maximize to train your-- to train their model. And this is exactly what people do in the Gaussian process literature. And this is something that also Kris will tell you about in the domain of GPFA.
OK, so here, just to give you a bit of intuition, I just made a simple heat map where I'm showing you the value of that particular function, called the marginal likelihood-- the log marginal likelihood, in this case-- as a function of the noise level and the timescale. And you can see there's a clear region that's clearly more likely than the other regions. And it's this region here with a fairly low noise level and a length scale at about 0.1.
And in fact, if you ask me how did I generate these data points, well, actually, I picked sigma equals 0 and l equals 0.1 So you can see, we are not too far. We are a bit further because-- we're not exactly spot on, because of course, I only drew five data points. So that's not nearly enough to constrain the model parameter exactly and to get them to that ground truth, but it's pretty good. And of course, it's all consistent in the sense that, if you were to observe more data, and then eventually you would [? recover ?] the right model.
OK, so I think that's all I wanted to say about Gaussian process priors. So one thing to note, in terms of, now, transitioning into Kris's second part, is that you might have realized that what Gaussian process priors do for you is they allow you to share information across time. They allow you-- so for example, here, you see that the value of the function between the third and fourth point is now actually constrained by what you've observed at other time points. And this is what we call information sharing across time.
So Gaussian processes are a very natural way of expressing these kind of constraints. And now Kris will tell you about how latent variable models, which is orthogonal to this idea of Gaussian processes-- but how latent variable models themselves allow you to capitalize on the fact that latent variables that we seek to extract are constrained not just by one neuron, but they are constrained by more than one neuron. So the question is, how can we share information across neurons to get better estimates of these latent variables? OK, so I'm going to hand it over to Kris now.
AUDIENCE: So Guillaume, I have one question, which is, you mentioned we can have several forms of the correlation kernel-- so you show the oscillatory, the Gaussian form-- how should we go about choosing the form of this kernel? Is it based on our intuition of the data?
GUILLAUME HENNEQUIN: So your intuition of the data is-- or the intuition you have about the data are a great place to start. But in fact, the recipe that I've told you about to optimize the hyperparameters can also be used to perform model selection. So you can actually try different kernels and pick the one that best explains the data. That's another sensible way of going about it. Yeah.
KRIS JENSEN: Yeah, so as Guillaume said, now he's introduced Gaussian processes. And the focus here is Gaussian process and latent variable models. So now we'll talk a little bit about latent variable models before putting them together. And we're going to keep talking within this Bayesian probabilistic latent variable modeling framework.
And so the first thing I wanted to point out here or talk about is actually a method known as factor analysis, which is just a slightly different version of PCA. And the idea is that you have some sort of latent variables, x. So here, x is a vector. So the length of x is the number of latent variables you have. I might be two or three if you want to plot your things.
And you have some latent variable at different points in time. And what factor analysis assumes is basically that at each point in time, your observations are given by your latent variable models-- sorry, by your latent variables via a linear observation model. So this arrow here is really your latent variables times some weight matrix plus Gaussian noise gives you your observation.
And what you're trying to do in factor analysis is really to capitalize on the fact that there is some correlation structure along your neurons which allows you to summarize the activity of many variables, all your neurons, using a small number of variables, you're latent variables, by, essentially, doing this information sharing across neurons. And you're going to be able to learn this because you have many different samples, or in other words, many different time points, that can help you constrain this relationship between your observations and the latent variables.
What Gaussian process factor analysis says is not only are latent variables constrained by these observations, but they're also constrained by each other. And the way in which they're constrained by each other is exactly via this Gaussian process covariance function that tells us how this latent process basically evolves over time. And so just to illustrate this, we've made a Colab notebook. And this should be the first Colab notebook link.
And what we're going to do in this Colab notebook is, essentially, try to, as Guillaume said, draw data from this model to see what the resulting data looks like. And then that is the data that has a high probability over-- under our model. And what we learn when fitting our model is going to be something that looks like that generative process. And so the notebook's called something like MIT GPFA. One thing that is worth noting here is that later on, we're going to have a second notebook, which is what we're going to use to actually fit data. It's called MIT [? BGPFA. ?]
And so here, we also need to download some data and install a small software package. And this might take a couple of minutes. So for those who are kind of following along at home, it's worth just running the first three code blocks of this notebook right now, because then it'll be done running by the time we get there. But with that said, I'm going to return to this first notebook here. And this first note--
GUILLAUME HENNEQUIN: Sorry, Kris. [INAUDIBLE] was saying that you do it if you want the thrill of actually going through step by step by yourself. But Kris is actually going to walk you through the whole process anyway. So you don't have to do it.
KRIS JENSEN: Yeah, and these notebooks will remain publicly available. So you can also do it tomorrow, or the day after, or when you feel the desire to think more about Gaussian process latent variable models. OK, so what this notebook is doing is really just trying to drill down this difference between what is factor analysis and what is Gaussian process factor analysis.
And so the first thing we're going to do is just load a couple of libraries and set a couple of parameters, like how long did we record our patient neural population for, how many neurons did we record, and how many latent variables do we want to fit. So these are all kind of-- I mean, these are experimentally-determined parameters. This is kind of like a hyperparameter. And then we're going to-- because this is really about the generative model, then we're going to specify the actual ground truth observation model, which is specified here by the readout matrix. This all here is basically just a plotting function, so I'm not going to talk about it.
So the first thing we're going to talk about is factor analysis. And as we said, we recorded some data, and we have some latent variables and this readout matrix here. And we are assuming a priori that there's no correlation between our latent variables. They're just drawn from some independent Gaussian distribution. So every time you record a new time point in your experiment, you draw a new latent variable. And that's kind of what we were assuming when we were fitting this model. And then our observations, y, are given by some noisy observation of a linear function of the latent variables.
And so this looks like if you actually draw the data is that your latent variables here are colored according to time. So the closer the color is, the closer in time are you latent variables. And we're putting one latent dimension versus the second one. And so what we're seeing is kind of a salt and pepper pattern. So there's absolutely no relationship between this red guy and this red guy, and they can be infinite-- well, they can't be infinitely far apart, because we assume some covariance. But they're no more-- they're no nearer each other, on average, than this red point and this blue point.
The second thing that we've assumed is that for a given latent variable, if we look a the firing rate of one example neuron-- so that's here, plotted as a heat map, this is the firing rate of a neuron-- then that firing rate is going to be a linear function of where you are in this latent space. And finally, once you've found the firing rate, then the actual observation that you recall, so your actual spike count, is going to be some noisy version of this firing rate where the noise is a Gaussian noise model. So this takes on any real value with finite probability.
So we can contrast this with Gaussian process factor analysis. And what we're assuming here-- this is basically-- all the equations are the same, so I'll not talk about them in detail. You can look at them later if you want to. But the key difference here is that, now, our latent variables are drawn from a Gaussian process like the one Guillaume talked about. And so if we try to plot this, we see that now latent states that are near each other in time, so have similar color here, are also going to be near each other in space.
So they're constrained by one another. But given this latent trajectory, we're still assuming a linear observation model. So this is the firing rate of an example neuron. It's still a linear function in this latent space, and we're still assuming Gaussian noise here. And so we can try to draw some more things from this process here. And they'll take on different shapes. These are, essentially, like the 1D draws that Guillaume showed before just in two dimensions.
And so the idea here is that because we've built this assumption into our model, then when we feed the model to the data, we'll get something out that also looks like this. And we'll show you this for real data in a little bit of time. And you can think about what happens when you have different type of parameters. Oh, my wireless keyboard just lost connection here.
GUILLAUME HENNEQUIN: And here, you can see that I didn't lie, that it's really just two lines involving a Cholesky decomposition and a matrix multiplication.
KRIS JENSEN: Yeah, this is basically [? code too ?] for all this. And so again, you can see that things can change from slower timescale to faster time scales. But the interesting thing here is that, now, we're not only drawing the latent variables, but we have this observation model on top, which is basic given by the factor analysis model. And so we can think about what this actually means in a neural space here.
So here, we have a fairly slow process. And so you see that the firing rate of each hypothetical neuron here is only going to change slowly in time. Whereas, if we have a fast process, or a fairly fast process here, then the firing rate of neurons is going to change slightly faster in time.
And so the idea here is basically that you're going to learn-- by fitting the model, you're going to learn both how the neurons or your observations relate to these latent variables, and you're also going to learn the timescale over which the latent process evolves, which is basically going to reflect the timescale over which neurons evolve or neural activity evolves. And if you believe that this somehow relates to behavior, this is somehow going to reflect the timescale of the behavioral process, or the decision-making process, or whichever over process these neurons are really-- whichever process the activity of the neurons is reflecting.
OK, so that was kind of the very intuitive, high level overview of factor analysis and Gaussian process factor analysis. And this is really a very nice method that was developed by Byron Yu and these guys because it allows us to do information sharing both across neurons and also across time. And given our noisy neural recordings, this is a very sensible way to get at some noise-free variable that we can use for downstream analyses.
However, these models are not all gain, no pain. There are also some caveats with them. And some of these challenges involve the fact that we compute this kind of marginal covariance or the covariance of the marginal likelihood, which is a Gaussian, and this is the number of latent dimensions by the number of time points. And this means that it can be quite computationally expensive to fit, and in particular, that the time of fitting your model is going to scale cubically with time-- is going to scale cubically with the number of time points that you've recorded. Sorry, there are two different notions of time here.
And so as an example, if you imagine that you recorded 100 trials, each of which consisted of 100 time bins, and this took you 10 minutes. If instead, you recorded a single trial with 100 times 100 time bins, then this would take on the order of 70 days to fit. And so this kind of scaling with long trials is really not very good if you, for example, want to fit continuous behavior.
The second thing is that we're kind of assuming everything is Gaussian, and this can be a fairly poor description of neural recordings. So here in black is a heat map of the spike counts in each bin for the data that we saw on slide 2. And we see that this is some kind of discrete distribution. But what we're assuming is that it can be described as Gaussian, which also has finite probability math for, for example, non-integer counts and negative counts. So this can also be a bit of a problem.
And the third problem is that-- this is kind of a problem with a lot of these models is that they have a lot of hyperparameters. And in particular, it can be quite difficult to figure out how many dimensions you should retain in this latent space. And you might want to do some fairly expensive [INAUDIBLE] to figure this out. And so something we've been working on lately, to talk a little bit about our own work now, is that we've been trying to address some of these challenges to make a slightly more neuroscience-friendly version of GPFA.
And to solve the first and the second issue, this computational scaling and the Gaussian noise, we can use another method from machine learning known as variational inference. And this is not something we'll talk about any amount of detail, but this is really becoming very, very common to do, both in machine learning and then also in the neuroscience literature. In the computational science literature, people are increasingly doing this, taking it from the neuroscience literature. And so this was developed for Gaussian processes by James Hensman et al. in 2015.
And the other thing we can do to automatically pick the dimensionality or the number of latent dimensions which it retain, is we can put a prior over our readout matrix, C. So now, our observation model is no longer deterministic with a parameter, C, that we learn as a maximum likelihood estimate. But instead, we treat this observation model as another, essentially, latent variable that we're trying to infer a distribution over.
And this is kind of related to what Guillaume was saying with the Gaussian processes. If you had a very wiggly line, you'd be unlikely to draw a process that pass through all your data points. Whereas, if you had a line of appropriate wiggliness or appropriate complexity, then it might pass through all your data points. And it's kind of the same thing we're doing here. We're putting a prior over C, and we're integrating it out.
And so you have too many latent dimensions, all of which contribute to your observations, then you're unlikely to see the particular set of neural activities that you see. And if you have a model that's too simple, then it doesn't have enough flexibility to generate them either. But if you have a model of the correct complexity, then you have a high likelihood of your data. And so this is actually something that we can learn automatically on the go.
And so putting all this together, we came up with a method called Bayesian GPFA. And this is really just a slightly fancier version of GPFA, which we implemented in Python. And it's worth noting that for people who want to just implement standard GPFA, there are also toolboxes available for this. For example, DataHigh in MATLAB and Elephant in Python allows you to fit standard GPFA. We're just going to do Bayesian GPFA here, partly because we developed it, partly because it allows us to fit single long recordings, which we think is quite nice.
I'm just trying to find my mouse on the screen here, which is not going well at all. OK, here we go. Cool.
AUDIENCE: Can I ask a quick question?
KRIS JENSEN: Of course.
AUDIENCE: Yeah, so you were talking about the limitations for GPFA. I was wondering if you could also-- if there's any data limitations that you run into, both with normal GPFA, and then also with the Bayesian version? So for instance, how much data do you need in order to actually fit a model that's reasonable?
KRIS JENSEN: So the model that we're going to fit here-- I mean, you can fit it with a single trial, essentially. And you can get reasonable results for a single trial, I would say. Guillaume, you can disagree with me if you want to disagree with me here. I'd say, especially standard GPFA, if you have a lot of neurons, you need to learn all the parameters for this readout matrix.
And so it's a lot of parameters you need to learn from not that much data. This Bayesian version has many fewer parameters, and so you might do slightly better in the really small data setting. But I think either method, a handful of trials, of a couple tens of neurons, you'll be fine fitting it. I'll put my money on a reasonable model fit with Bayesian GPFA for a single trial if you have 50-ish neurons.
GUILLAUME HENNEQUIN: I mean, I would complement that by saying that Bayesian GPFA is probably-- it's the safest bet in the [? neural ?] data regime, I guess. I don't think there's any GPFA version or even latent variable model in neuroscience that I know of that would actually do better in a low-data regime, precisely because we've eliminated most of the parameters.
It's mostly nonparametric, so I think you're on the safe end. But then, define how well you can fit a single trial, you just do as much-- as well as you can given the amount of data you have. And maybe a single trial is just not enough information to give you good reconstructions of [? phase ?] portraits of dynamics and this sort of stuff.
KRIS JENSEN: All right. Here we go.
GUILLAUME HENNEQUIN: I think the question you should ask is how much-- what's the maximum amount of data you can handle. I think this is where these methods tend to be a bit trickier. But I think this is also where Kris and Calvin have worked very hard to make the method scalable. I mean, Kris will tell you more about that. I mean, you can actually see in this notebook already--
KRIS JENSEN: We're only fitting a little bit of the data here because it has to run in a minute or two. But in theory, we can fit-- so what we really wanted to be able to do was fit freely-moving behavior, because I think-- at least I think, increasingly, that this is partly where neuroscience is moving and partly where it gets really, really interesting, when we no longer have this trial-constrained paradigm, but we really allow animals to behave together or individually or interact with their environment.
And we're not relying on this averaging across trials, we're just relying on these a priori assumptions about how neural activity involves to constrain it instead, using these Bayesian model parametrics. And we can fit these really long trajectories of an animal just moving around for half an hour.
So this is a monkey reaching on a screen for just half an hour. So it's really a single trial. But the single trial is 70,000 time bins. But that's the regime that we're trying to move towards. And this is the first step in that direction. And it's definitely not going to be the end all method, but we think it's a nice way of beginning to look at these problems.
Cool. So now, some equations. They're basically the same as the equations with GPFA. I'm again, not going to go through them in detail. They're just here for completeness. The key thing here is, again, that we have this prior over our observation model itself. And so now, when we compute the marginal likelihood, then we need to integrate out this readout matrix as well. And so these cells have all run. That was just installing our library.
So this is a latent variable modeling library that we've been working on. And part of that library is this Bayesian GPFA. We're also trying to do a couple of other things, which you might see in the code later. Then we just need to import a bunch of libraries. This is just some stuff that we'll use for the analyses after fitting the model. We load the data here. So this data is taken from O'Doherty et al., 2018.
I've just put a single session on Google Drive, which we're downloading here. But really, see that paper if you want the full data set. It's a really, really nice data set. And then we're going to [? sub ?] samples only fitting about a minute of data, 40 seconds of data. And we're only considering some of the neurons. This is just so it can run live. So just to see the data that we're going to fit, this is also what you saw on slide 2. So we're fitting, roughly, 70 neurons, a couple of thousand time bins.
And this is basically just a monkey reaching on a screen. I'll plot the behavior a bit later on. We're going to set some parameters for optimization process. I won't talk about this in detail, but this is things like the maximum number of dimensions that will allow the model to fit. Even though we don't pick a number of latent dimensions, we need to have a maximum just to initialize all the variables and stuff. And so in theory, this could be the number of neurons you have, which is an upper limit. It would just take a bit longer to fit, so we start with something lower.
Then we construct the actual model. And so this is where we've written this latent variable model where you can specify the observation model here. So this is where we say we don't necessarily want to use a Gaussian noise model for discrete spike counts. So here, we use a negative binomial model, which is kind of a generalized version of the Python model. We specified that our latent variables live in Euclidean space.
This is related to some of our previous work where we tried to develop these latent variable models where the latent variables can, for example, live on a circle, which might be nice if you're fitting head direction circuits or thinking about all that kind of rotational dynamics. And then we're specifying that it's a GP prior over the latent states, in contrast to, for example, these linear dynamical systems priors that other people use. And then we initialize the model.
And having specified the model, we can start training it. So we see that we're fitting-- oh, actually, 93 neurons for 1,500 time bins. And so this shouldn't take too long to fit. And while it fits, we'll just see here that what it plots here is basically the prior length scale for each latent dimension. So we said we put a prior over this observation model. And this prior has a single parameter for each dimension. And that's basically how important is this dimension.
And that's the flexibility that allows the model to discard dimensions that it doesn't need. And so basically, as the optimization process precedes, it will try to explain all the data that we've recorded. It will basically try to explain this raster plot up here using the latent variables. And then when it realizes that, actually, the representation it's learning is a bit lower dimensional, then we don't need a latent variable anymore. And so we can learn that it's not necessary, and then our marginal likelihood ends up getting higher because we're no longer integrating over the noise injected by that latent dimension.
AUDIENCE: Quick question-- can you say a little bit more about the manifolds, the Euclidean-- that line? You went really fast.
KRIS JENSEN: Yes.
AUDIENCE: I want to hear a bit about it.
KRIS JENSEN: Yeah, sorry. So this is basically because this is kind of a remnant of a different part of this repository. So any latent variable model that you're going to be familiar with implicitly assumes that the latent variables live in Euclidean space. You might plot them in 2D space or in 3D space, but we always plot them on an axis, which is basically the real numbers.
But if you imagine you're recording from something like a head direction circuit-- this could be the ellipsoid body in drosophila-- and you think that they're actually representing a variable that lives not on the line. The head direction doesn't live on a line, it really lives on a ring. It takes values from 0 to 2 pi. And the value 0 should have the same representation as the value 2 pi. You can also think of this as building in a prior that these things should be the same. And so it's not super trivial to build a model that has the leading verbals living in the space, for example, because linear functions aren't really well-defined in this space.
But you can define a Gaussian process that maps from this latent space, the circle, to the space of observations, basically, by having a covariance function where the arguments of the covariance function-- it's not real numbers, as it was in the example Guillaume showed, but they're really angles on the circle. And the covariance function says that if the angles are close by some, for example, cosine distance, then the output should be similar.
And so this is kind of a generalization of the Gaussian process latent variable model by Anqi Wu where not only is the prior over the latent variables a Gaussian process, but the actual observation model that maps from the latent variables to the observations is also a Gaussian process, but where we allow that Gaussian process to live not in a Euclidean space, but on a sphere, or on a circle, or on a torus, or something like that.
And we think this might be useful for certain purposes. So we've developed a model for doing that, and that's included in this library. But I would not recommend running it right now because we've updated a lot of the codes. The example scripts run, but hopefully, they will soon again. But this Bayesian GPFA branch is pretty stable if you guys want to run it. Anyways--
AUDIENCE: Just to clarify, so the one you use now is the Euclidean or--
KRIS JENSEN: Yeah. Yeah, so this is just Euclidean here. Yeah. So here, we just say the manifold should be Euclidean. And basically, this variation distribution, which is a Gaussian process, won't work if the manifold isn't Euclidean, because you can't actually specify Gaussian processes with an output on the circle, at least not easily.
AUDIENCE: I get it.
KRIS JENSEN: Cool. So in the intervening time, our model has finished fitting. So this took-- it took a little bit longer. So basically, it depends on how busy the Google servers are at the moment since we're using Colab. So we can then think about how many dimensions did our model actually retain.
We said that we were going to learn this. And so here, we basically plot how important is each latent dimension. And along the other axis, we basically plot how-- what the average value of the corresponding latent variable is. And so these are going to be heavily correlated. I just plotted them against each other because we get a 2D plot, which is nicer to look at.
And we see that there are five latent dimensions that were found to contribute significantly to the model that are basically-- have non-zero values at these two parameters. And the rest were discarded and found to not be necessary to explain the data. And these here, these retained dimensions, also have variable amounts of information about out actual observations. So we can rank them similarly to how you could rank PCs by variance explained. This is basically variance explained along the x-axis.
So we can then take the most informative dimensions, and we can plot them and say, what do these actually look like? And we see that, indeed, they do form some kind of smooth trajectory. And so here, we contrast this with if we fitted factor analysis to the same data set, this kind of single-trial data set. And factor analysis gives you a bit of a mess because it doesn't allow this information sharing across time. And each bin only has a couple of spikes per neuron. And so it's just not as constrained when you don't have loads and loads of trials to average over.
And so I also printed the learned time scales. So this is basically the hyperparameter of the Gaussian process that Guillaume talked about. And we see here that the two dimensions we are planning, which are the two most informative dimensions, are fairly fast dimensions in the sense that their auto-correlation time is roughly 100 milliseconds. So this is on the order of motor preparations or motor initiation.
We also learned some longer timescale latent processes. It might be fun to plot these just to see what that would look like. And it won't look any different from the previous notebook where we just noted that longer timescale things evolve over a longer time scales. So you can kind of also think about what's the meaning of these fast processes, and what's the meaning of these slow processes, and how might they reflect to fast or slow behavioral features that we're interested in.
So in this case, we might be interested in actual hand kinematics because we're looking at a reaching task. And so here, I'm plotting the hand trajectory for the same amount of time. And so we see that this is basically corresponding to a single reaching trial. So we're really almost fitting a single trial here. And we can also plot the velocity. And so clearly, there's some kind of behavior going on, and we have some kind of latent processes going on. And we might, at least as a sanity check, try to ask, well, are our latent processes actually related to the behavior?
And the simplest way to do this is basically by doing some kind of decoding analysis. And we do find that, actually, we can decode kinematics from these latent processes fairly reliably. Again, see the notebook for details later on. And so here, the x-axis is basically the delay between the latent process and the behavior. And what we find is that, actually, the neural activity of this latent process predates the behavior that it predicts most strongly.
And this is kind of related to this dynamical systems view of motor control or this preparatory activity of motor control. And so we do a lot more of these kinds of analyses in our recent paper, but what we wanted to give here was just a gist of how we can fit these models, how they're different from factor analysis, and why it might be something we want to do, and how we can relate it to behaviorally observe or experimentally observe variables that we care about as neuroscientists.
And so that's the end of this section. And the next thing we'll talk about is this GPFADS model, which is GPFA on steroids. And before then, I just want to see if anyone has questions.
AUDIENCE: How do you deal with modeling many neurons for the same task across many different subjects, rather than just within-- I think, just within a single subject? So for instance, there might be some subject variation that you want to compare to within-subject variation.
KRIS JENSEN: Right. So I think, as far as I'm aware-- Guillaume can correct me if I'm wrong-- I'm not aware of many people who fit these kind of latent variable models jointly across multiple subjects. You would have to make some a priori assumption about what is it that's shared across subjects and what is it that's idiosyncratic to each subject.
What you can, of course, do is just fit them all to each subject, and then do analyses like this decoding thing or things like that and try to see how the dynamics, or statistics, or timescales of your processes vary along subjects. In terms of fitting a single model across subjects-- I don't know, Guillaume, do you have any-- do you know of people doing this of the top of your head?
GUILLAUME HENNEQUIN: I don't know of people doing this in [? GPFA-land. ?] But I could imagine, for example, that it would be pretty easy to tie, for example, the kernel parameters, like the [? line ?] scales and stuff, to just have one set that's representative of a pool of monkeys or a pool of mice, or whatever, especially with your variation inference formulation, I think.
KRIS JENSEN: Yeah, so if you want to fit multiple things [INAUDIBLE], you basically decide on which parameters are shared across the population.
GUILLAUME HENNEQUIN: Exactly, yeah.
KRIS JENSEN: And then you would add all of the [? log ?] likelihoods across your entire population. And then you would take gradients with respect to that during your optimization process. And that's how you would learn it. If you were doing something like LFADS, which we are not at all doing here, but where you really imagine that you're latent process is a fully specified nominally dynamical system, you could make the assumption that that entire dynamical system is shared across subjects. And you can compare that to fitting models within subjects and try to say something insightful about whether it is, in fact, the same or it is, in fact, different. But I think that's less trivial with these Gaussian process models.
AUDIENCE: Yeah, I guess I'll add a question to this because I think it's interesting, which is, would you be able to just somehow look and see if you get latent factors that are subject-specific versus things that do vary across all of the participants, like if you just did throw everything together? Or do I have the wrong idea.
KRIS JENSEN: So one thing we do is we-- so basically, you can make different modeling assumptions. Again, this is all about the assumptions that go into your model. It's really the learning what we know. So we specify what we know, and then we see what comes out when we do the learning. And so we don't make an assumption that the latent process is the-- even if we were to fit multiple trials, we wouldn't make the assumption that the latent process is the same on all trials.
However, if you did do that, then the latent state, you could assume that that would be shared across subjects as well. And because the latent state is shared, then you could see whether you need more latent dimensions for more subjects or whether they somehow share the latent space. I think that could make sense to do. But I think that would require you to tie the latent trajectories across trials and across subjects. Guillaume looks like he might not agree with me that this is even possible.
GUILLAUME HENNEQUIN: I don't know, I think-- I mean, a space to watch, maybe, in this direction is what's going on these days with modeling of multiple areas. Because these people have the same type of problem. I know Byron Yu has a paper, for example, where it's about modeling what's shared across two areas and what's private.
KRIS JENSEN: We also have a paper that talks about this.
GUILLAUME HENNEQUIN: So you could imagine also-- yeah. Say again?
KRIS JENSEN: I'm saying, we also have a paper that talks about this, which is exactly this paper.
GUILLAUME HENNEQUIN: But in terms of-- you could imagine also doing a-- you would need to explicitly-- just like Kris said, you would need to explicitly write down another modified GPFA model that explicitly fleshes out which aspects of the latent you believe. So you would have a multi-stage latent, something that's shared across animals and something that's private.
And I think there, the Bayesian methodology actually allows for a pretty straightforward conceptualization of these ideas, I think. I'm not aware that it's been done, but I don't see that it would be a big-- at least conceptually, there would be a big problem to write down. Yeah.
KRIS JENSEN: And I guess, in these days of automatic differentiation and machine learning toolboxes, the distance from it conceptually making sense to actually be able to write it down is really, really very short. Because as long as you can write down the marginal likelihood and compute it, then you can do the optimization.
GUILLAUME HENNEQUIN: And when you have great PhD students like Kris, the distance to an actual implementation is even smaller.
KRIS JENSEN: Cool. Any other questions about this data fitting part of it? Otherwise, we'll move on to the last part, which is Guillaume talking about this GPFADS model.
GUILLAUME HENNEQUIN: I'm planning on just 15 minutes on this last part, which is about non-reversible Gaussian processes. And it's basically going to be a summary of this paper by Virginia Rutten, Maneesh Sahani, and Alberto Bernacchia, and myself. And it's addressing the following question. So what Kris has shown you is that you can use GPFA to extract latent processes or latent trajectories from population data.
But the way that the field these days tends to think about these latent processes is that they should reflect some sort of dynamical process. And wouldn't it be nice if on top of extracting these latent processes, we also, along the way, learned about the dynamical system that underlies them. And by dynamical system, I mean something that-- like a flow field, like x dot equals f of x or some function f.
Maybe, it might not be just as autonomous as what I've written now. It might be input-driven. But can we extract not only those latent processes, but also the dynamics, the flow field that governs their temporal evolution. So that would be something great to have. And indeed, this is something that a lot of methods which Kris has already mentioned actually tackle, things like, even, linear dynamical systems like Kalman filters, PfLDS, LFADS. And also, in [? Gaussian-process-land, ?] there's also methods for [INAUDIBLE] GP state space model that place Gaussian process priors on the function f itself.
There's nothing wrong with these method. I think these methods are great in their domain, especially for these problem. But I think we're-- in doing that, we're losing the nice Bayesian non-parametric approach that we have in GPFA. You can see here that in the notebook that Kris just walked you through, learning was actually pretty smooth. We know what sort of learning curve to expect in GPFA. You get a pretty smooth and well-behaved decrease of the loss function.
And these things are pretty well understood. The optimization is pretty robust. And you don't need that much data. So they are very data efficient in contrast to these things. So I think there is room for a method that inherits some of these nice properties of GPFA but also brings it closer to the realm of dynamical system, which is really underlying most of our thinking in systems neuroscience at the level of population data these days.
So what's the problem with GPFA? Why is GPFA really not giving you that for free? Well, the main thing that flies to my face at least is if you look at the GPFA prior and you draw some samples, it looks something like this. Kris has shown you other wiggles, and you can draw them for yourselves in the notebook. But basically, you get something like this.
And if you think about it for a second, these trajectories, so your prior assumptions, really, about what these latent processes are in GPFA, they violate an absolutely key property of autonomous dynamical systems, which is temporal non-reversibility. So these trajectories here on the left that were drawn from, I think, the Duffing oscillator, a low-flow flow field, well, basically, they obey a flow field, which means that if you are at a given position, let's say, at the top left of that yellow curve, you've got a derivative that's all determined for you.
And so the trajectory can only move right in this case right. Whereas these Gaussian process priors that we've used in GPFA, they don't do that. So for example, this gray one goes back onto itself in a way that suggests strong reversibility, a perfect reversibility of the dynamic. So it's equally likely to generate these trajectories in one way, in one direction, as it is to generate them in reverse. And mathematically, this is very easy to show. There's no temporal ordering between these latents that would break this reversibility.
So what we set out to do was to cook up a new Gaussian process prior, a new family of priors, really, that express exactly this property. So I'm going to walk you through our construction, but it suffices to say at the moment that, for now, that these samples that I'm drawing here are samples from our new non-reversible prior that have zero probability of having occurred in reverse.
So we've drawn them from a particular prior that I'm going to describe soon. And you can take my word for it at the moment that they would have never been generated in reverse. They place, so it's again the probability distribution over these trajectories, and there's zero probability mass on the reverse ones.
So how do we do that? Well, first, big picture, the hope is that if we can cook up these non-reversible priors and incorporate them just as drop-in replacements to the standard priors in GPFA, we would have a method that would bridge these dynamical systems methods and these nonparametric GP-based methods like GPFA.
OK, so how do we-- where do we start? So the first thing I need to tell you is how do we quantify reversibility in Gaussian processes, or in spatial temporal time series for that matter. So Gaussian processes, as I told you in the GP primer, they're fully determined by their main function and their covariance function. So we're going to have to look for a measure of reversibility that is based on covariances.
And so we do it as follows. So imagine you have-- so I'm just going to walk you through the empirical calculation of this non-reversibility measure on a bunch of samples. Imagine you have a bunch of trials recording N neurons over T time bins, and you've got a bunch of trials. You could, for example, chunk a long recording into trials.
And then what you can do is you can build the space-time covariance matrix, which is this part here, which is really made-- so you have N neurons and T time bins. So this is going to be made of a matrix of N by N blocks where each block is a T by T matrix. So for example, here that block at the top left, C1 1, is the temporal covariance of the first unit with itself. So this is, for example-- I showed you one example of that in my banded yellow blue matrices that I showed earlier.
So this is just a temporal covariance that specifies the autocovariance of unit 1 of neuron 1. And this one, for example, would be how neuron 1 covaries-- neuron 1 at time T covaries with neuron 2 at time T prime. So this is the cross-covariance between the two neurons. And so by building all possible covariances between all possible time points and all possible pairs of time points and pairs of neurons, you get this big space-time covariance matrix C.
And what we do is consider this space-time covariance matrix minus a version of it where you've transposed all the T by T blocks. So C1 2 becomes C1 2 transpose, C1 2 transpose becomes C1 2. Here, there's this symmetry between this block and that block. And they have to be the transposed of each other, just because of-- because it's a covariance.
And so we transpose these T by T blocks, and we take the difference. And we get something that we called C dif for difference, which has this structure. We know that these, C1 1, C2 2, et cetera, they are symmetric. So if we remove their transposed, we get 0. So this matrix C dif is going to have blocks of 0s along the diagonal. And then it's got these patterns on the off-diagonal regions.
So why is that interesting? Because this actually captures-- it captures the odd components of the cross-correlations between units. So if there is a systematic lag, for example, in these samples, in these activity trials, if there is a systematic lag between neuron 1 and neuron 2 in one direction, for example, then it will show up as a slightly offsetted peak in the cross correlogram between neuron 1 and neuron 2.
And if you remove the even component, which is basically this operation that we performed here, if you remove the even component of that cross correlogram function, you get the thing that identifies this systematic lag. So long story short, we defined this so-called reversibility index zeta, which is the sum-- basically, the matrix norm, the Frobenius norm of the C dif matrix divided by that of C plus.
And C plus is defined in a equivalent way, where instead of having a minus here, we have a plus. So it's basically all the even parts of the cross correlograms. So it's a very simple construction, and it gives a reversibility index that we can prove, actually, to be-- which Alberto proved, actually, to be anywhere between 0 and 1. It's bounded by 0 from below and by 1 from above.
So if zeta is 0, then you've got a perfectly reversible process where there's no systematic ordering between neurons. All pairs of neurons have the cross correlogram that peaks at 0. And if it's 1, then typically, what tends to happen is you've got sequences of activity that repeat themselves across neurons. It's a measure of sequentiality, or at least, it's a measure of second order non-reversibility.
OK, I'm not going to comment on that. So the question is, how do we construct Gaussian processes that have this property of non-reversibility? And I'm just going to very briefly outline our construction, and then I'll show you examples and one application of GPFAD. So actually, I have five minutes, so maybe I should even skip this. This is the most technical part, so it just suffices to say let me just go very quickly over that.
This is the main technical contribution that we brought here, which is a constructive way of building these non-reversible Gaussian process covariance functions. So the puzzle we had here was how to add these required off-diagonal blocks here in such a way that you don't break the fact that you want a valid covariance function, which is-- has to respect the mathematical properties and the fact that you want to have a purely non-reversible process at the end.
So the answer-- again, don't stare too long at this equation, because I'm not going to explain it-- but the answer relies on what is known as the Hilbert transform of the base kernel that we take. So we choose a base kernel, for example, the squared exponential kernel, which is going to be the covariance of-- so here, we're talking about two latent processes.
And we're trying to construct a joint kernel for these two latent processes. And each of x1 and x2 is going to have a marginal covariance that is going to be given by this base kernel. And then we break reversibility between these two latents by introducing this Hilbert transform here and the symplectic matrix.
Don't worry if you don't understand that. I'm probably not going to have time to explain that in too much depth. But the bottom line is that this construction results in valid covariance that breaks reversibility. And you've got a free parameter here, alpha, that allows you to go all the way between 0 reversibility index, which means fully reversible process, all the way to 1, which means fully non-reversible.
OK, so here are some example draws for different types of marginal kernels, in fact, the type that I've shown you earlier. The squared exponential-- so you can see that it tends to curl a bit on itself. And in particular, note that it never really turns back. You can see that it's really non-reversible. Then this is for the cosine kernel, which is, basically, a great kernel if you wanted to generalize.
For those of you who know JPCA, this is what you would use if you wanted a probabilistic version of JPCA. So these generate pure rotations of different radii where the radius varies across samples. Then there's a spectral mixture that's a bit of a mixture of both that has more free parameters. So you can see that we can actually adapt this construction to a variety of kernels. And I posted a link to the Kernel Cookbook by David Duvenaud on the chat for those of you who are interested in learning more about kernels in general.
OK, so what does that buy us? So very quickly, what it buys us is that it captures a particular type of coupling between these two latents such that, now, observing the first latent is going to tell us more about the second latent than they would otherwise-- than it would otherwise if we had decoupled them completely in the prior.
So this is a very simple example where I drew from a non-reversible model. So this is this little wiggle on the right shown in the x1, x2 plane. And I'm showing here on the left x1 and x2 separately. And say we are observing x1 at four data points, and we're observing x2 only at that last data point here on the right. So then we can cook up a posterior distribution assuming zero-- assuming pure reversibility. So zeta is 0.
You can see that we do a decent job at fitting the blue dots. Because we have many of these dots, so we do a reasonable job here. But we do a pretty poor, even, miserable job, I would say, at fitting the red curve, the red latents. The red latent is basically the ground truth function in dashed line. So we don't even capture these kind of wiggles that we see in the ground truth.
Now if we put in the notion that it should be non-reversible, then we can see that observing the blue process at those point actually is very strongly constraining the fluctuations of the-- the temporal fluctuations of the red process, and we do a much better job.
So this is not true only for data that we generated from a non-reversible GP model in the first place. That's a bit cheating. I mean, that's-- it's no wonder that it works. But we find empirically that non-reversible Gaussian processes are much better models of data that are generated themselves from dynamical systems that obey any questions of the form, like x dot equals f of x. And you can look at the paper. We provide a few examples of these, a few different types of oscillators.
So now, I'm going to show you one-- the culminating point of non-reversible kernels in the context of GPFA. We use them as a drop-in replacement to the standard kernels, and we get something we call GPFADS, or GPFA with dynamical structure. And I'm just going to show you one application very quickly to a recording of an M1 population shared by Mark Churchland, and Matt Kaufman, and Krishna Shenoy.
It consists in 218 neurons. Actually, the data set is freely available on their websites where they ship the JPCA package. So you've got 218 neurons, 108 reaching conditions. So this is a monkey making arm reaching movements towards different targets, different barriers and stuff.
So you've got 108 different types of movements, and each of them was repeated by the monkey over several trials. And what they've done here for us, they've pre-averaged the data across trials. So you've got 108 trial averaged means, 108 trajectories in this 218 dimensional space.
OK, and they're also aligned to movement onset. So all of the trajectories that I'm going to show you next, they start roughly at movement onset-- actually, just shortly before a movement onset, actually, but all aligned temporally to movement onset across trials. And what I do here is I'm going to use GPFA with a non-reversible kernel that is basically a collection of these planar kernels that I've just shown you on the slide before.
So I'm just going to assume that I've got several pairs of latent variables, of latent processes, that come in pairs, and that each of them is potentially a non-reversible GP. And we're going to learn these free parameters that govern the non-reversibility index. And all the rest is very similar to GPFA. It is just we also learned the mixing matrix, and we learn the noise variances in the outputs, all that to maximize the probability that our model could have generated this data set.
So here's what happens when we only consider a single latent plane. If you consider a single latent plane, you get these wiggles. So here, time is color-coded, and you've got these 108 conditions. And you can see how they unfold in a pretty consistent rotatory manner across the different conditions. And here, the non-reversibility parameter, alpha, is going to 0.72, which indicates that even if you were to just retain a single latent plane, you find that things are pretty non-reversible already.
But the thing is, actually, if you just fit a plain GPFA model, you are not going to learn about the fact that it's non-reversible, but your latent's actually going to look near identical to that. So in other words, this is-- in GPFADS, because of the structure of the marginal likelihood, the first thing that it tries to do is capture variance in your data. There's nothing new in terms of what GPFA tries to achieve. It just tries to account for the data.
And the dominant term in the likelihood is how much variance you capture, actually. So GPFA alone would just give the exact same answer here. But what's more interesting is what you get if you give the model more room. So now we have two latent planes, so four latent variables that are partitioned into two planes. And you can see now that it finds a plane that's even more non-reversible than the first one and then a second plane that's a bit less non-reversible that captures the remaining variance.
And you can see that this is actually a better model. So we can compare. We can do model comparison by looking at the marginal likelihood difference between the non-reversible model and the vanilla GPA fit model that Kris told you about. And you can see that here, with two planes, we already consistently, depending on-- so the different dots at different splits of the data into a trend setting and a validation sets.
And these are marginal likelihood differences evaluated on the test set. And you can see that by and large, we do better already with-- by allowing non-reversibility. And finally, if you give the model even more room, you can find, now, two very strongly non-reversible planes and one that's even more non-reversible than was the second in the M equals 4 condition.
So what GPFA would do here is it would just return a latent space that has no reason to actually partition the latents into these very neatly-- these latents with very clear dynamical structure. It would just mix them up in such a way they would be difficult to disentangle. So in the same way that GPFA does simultaneous dimensionality reduction and smoothing of your data, GPFADS did simultaneous dimensionality reduction, smoothing, and the cleanly separating out the dynamical structure in the latent processes. So that's basically the take-home message of GPFADS.
So GPFADS seeks to capture variance, as does GPFA, but it's going to partition latent processes into planes where the non-reversibility is maximized because it just results in the better model. You can think of this, actually, for those of you who know JPCA-- I mean, I think of this as a generalization of JPCA, where JPCA says, I'm going to be looking for latent processes that are rotations.
But rotations are just one example of something that's non-reversible. And so our method says-- we're a bit more lax about what it means to have dynamical structure. Instead of requiring pure rotations, we just require to have a non-reversible statistical structure.
There's a bunch of open problems that I'm not going to have time to touch too much upon. But the major problem that we have in this method that is up for grabs is how to really generalize beyond latent planes. So here, we know how to construct these non-reversible processes for planar processes, but we don't really know how to construct higher dimensional priors in more than two dimensions that would allow us to max out the non-reversibility index. And the split-plane construction that I've highlighted here is suboptimal in many ways. So there's a bunch of open problems.
OK, I'm going to summarize with this. I hope we're pretty much on time. So in summary, what we've shown you here is that latent variable models allow you to use the Bayesian machinery to extract latent processes from your neural population data. Actually, they don't even need to be neural population data. Historically, we have mostly dealt with population data from the brain, but you could envisage applying that to many different types of multivariate time series.
I hope we've convinced you that Gaussian processes are conceptually simple. Thinking in terms of covariances is a very conceptually straightforward way of expressing your prior assumptions about what you expect these latent processes to look like a priori. Kris showed you a version of GPFA, which is fully Bayesian, which is flexible in that you can incorporate count models.
There are [? simple ?] non-conjugates that lead, usually, to very severe computational intractable, but-- which Kris was able to mitigate using variational inference. And he's got a fast version of that method that's coded up on GPUs and that's, I think, really ready for you to experiment with. And then, finally, I've shown you GPFADS, which we developed to bridge this dynamical systems perspective with this family of non-parametric methods that were the focus of this tutorial. And I think we're done.