Continuous-time deconvolutional regression: A method for studying continuous dynamics in naturalistic data
Date Posted:
February 28, 2022
Date Recorded:
February 28, 2022
Speaker(s):
Cory Shain, MIT
All Captioned Videos Computational Tutorials
Description:
Abstract: Naturalistic experiments are of growing interest to neuroscientists and cognitive scientists. Naturalistic data can be hard to analyze because critical events can occur at irregular intervals, and measured responses to those events can overlap and interact in complex ways. For example, words come quickly enough during naturalistic reading and listening that the brain responses to words likely overlap in time, and inherent variability in word durations can make these responses difficult to identify from data. In this tutorial, I will present continuous-time deconvolutional regression (CDR), a new approach to analyzing naturalistic time series. CDR uses machine learning to estimate impulse response functions from data, but, unlike established methods like finite impulse response modeling, these functions are defined in continuous time. CDR can therefore directly estimate event-related responses in a range of naturalistic experiment types, including fMRI, EEG/MEG, and behavioral measures. The tutorial will demonstrate how to define, fit, and evaluate CDR models, how to test hypotheses in the CDR framework, how to visualize patterns with CDR, and how CDR can be used to relax a range of assumptions about time series data. These steps can be run from the command line using an open-source Python library, with no need for users to write any code.
PRESENTER: Hello, everyone. Thank you for coming to the first of the BCS Computational Tutorial Series. Cory Shain is a post-doc in the Ev Fedorenko Lab. He's studying language in the brain. And throughout his research, he's been developing new methods for time series analysis.
So today, he's, going to be talking to us about that. And there will be a lab portion at the end of it. So make sure to download the data. You can come to us-- Greta or I-- if you're having issues with that.
CORY SHAIN: Thank you. Thanks, everybody, for coming. Yeah, just to clarify what to download, because it would be useful to get this going while I give the conceptual overview of the method. You can focus in the Google Doc, assuming you have access to it, on the section that says installation.
So you just basically can-- this command here, you can copy paste into your terminal window and run it in order to install the software. And then the sample data and models are at the link under Sample Data and Models. Thanks again for coming.
So today, I'm going to talk about a method that I've been working on developing to help us understand the dynamics of time series, such as ones that we often need to analyze in brain and cognitive sciences. And so as I said, I'm going to give a conceptual overview of the approach and what kinds of problems it's designed to solve.
And then we'll dive right into applied problems and how you might use this to actually analyze your own data. But the approach I'm going to be advocating today is called-- I'm calling it Continuous-Time Deconvolutional Regression, or CDR.
So in this department, want to use experiments to understand the mind. But the mind is a complex thing. And it has complex dynamics. So in my area, which is language, a lot of experiments will present participants and experiments with words or sentences in isolation that have been carefully constructed by the experimenters in order to test a hypothesis of interest.
But the phenomenon that we're trying to generalize to from those designs is this-- what's the mind doing in these kinds of situations, the kind of conversational setting for which human language processing is adapted? Comprehending multi-participant conversations or dialogues in a noisy environment, planning our own linguistic responses, and then integrating the information that we receive through the language medium into our broader conceptual model of the world.
So I think that there's growing interest in the language sciences and studying language in the wild in this kind of setting. And I think that growing interest is also true of a range of subfields in cognitive and brain sciences, where we want to know what brains and minds are doing and more naturalistic settings. So I think there's this complementary and growing area of research that looks at settings like this.
But to make that shift from careful control of the stimuli to a more naturalistic design, we end up kind of shifting the burden of experimental effort from designing the stimuli to designing the analysis. We don't have control over what kind of stimulation participants are getting. We only have control over how we're trying to make inferences about their mental patterns on the basis of the data that we get in the experiment.
And so we need good control of-- or implementation coding-- of the critical variables that we want to examine in the naturalistic setting. We also typically need strong controls for the kinds of compounds that we want to rule out or parcel variance away from. And then, we need a good statistical model that's going to allow us to make good inferences about the underlying processes. And it's on that last point about this statistical model that I'm going to focus today.
So in practice, I think a lot of times in naturalistic designs, we end up analyzing data using models that make pretty strong simplifying assumptions. And this can have some consequences for the results of our analysis. So models with excessively simplistic assumptions-- that is, assumptions that don't align well with the true, underlying cognitive processes-- can lead to misleading conclusions, because our results may be affected by modeling artifacts due to poor fit between the model and actual brains.
And they can also limit the scope of hypotheses that we can ask questions about. So for example, if we're assuming that a particular effect has a particular time course, then we can no longer ask questions about that time course. And then finally, the simplistic models can impede pattern discovery.
So we can end up short-circuiting our capacity to discover new patterns in the data that we wouldn't have seen if we used a less assumption-laden approach. So today, I'm going to be focusing on, I think, four key assumptions that are common in this domain in some combination and then how we can relax some or all of them using the approach that I'm advocating.
So the first assumption is what I'll call simple response functions, whereby a response function, I mean the function assumed by the statistical model to relate an event in the world or in the experiment to the response that it will evoke in our dependent measure-- in the thing that we're acquiring through the experiment, whether that's a behavioral reaction time or a neural signature. So but the simplicity kind of cashes out in different ways depending on which kind of experiment we're modeling.
So in the language world, we'll often do naturalistic reading studies where we take a look at how long people spend looking at words during-- while they're reading stories, or newspaper articles, or what have you. And so in that kind of setting, a common simplifying assumption about the response function would be that the properties of the word are the things that influence the response to that word and nothing else. But in reality, it may be the case that the response time at a particular word is influenced not only by that word, but by some aspect of the preceding context that we're not capturing well with our model.
So that's one kind of simplifying assumption in the behavioral domain. For example, in a naturalistic MRI experiment, a simplifying assumption might be that the hemodynamic response that governs the way that blood oxygenation changes in response to neuronal activity-- that that's going to be some fixed shape for all the voxels and all the participants that we are going to be looking at in our experiment.
When in reality, we know that that's not true, that the shape of this function changes between individuals, between regions of the brain, and that it can differ between different kinds of effects as a function of their underlying time course. So things that take longer to process can have a slower looking HRF than things that take less time. So that's one simplifying assumption that we can make about the response function in the MRI domain.
And then, in EEG for example, we might epoch our data, so kind of chunk out segments of the time series that are associated with the onset of property of the stimulus and then average those in order to get an event-related potential, or ERP. And the assumption there is that-- especially in a naturalistic setting-- that those averages are a valid way of finding that event-related response, when in fact that response could be influenced by other events that occurred in the context and that are being included in the average. So different assumptions we make about the nature of time can influence the models of these response functions that we obtain.
But that's just one of the assumptions. That's kind of the primary assumption I'm going to focus on today. But there are a couple of others I think that are relevant, as well. A very common assumption is that the variance of the response is constant. So we're going to have a constant spread of responses around the mean throughout the experiment.
And if that's not true, then that can have a big impact on the results that we obtain. Because we're going to be overly sensitive to points where we've underestimated the variance and excessively insensitive to points where we've overestimated the variance. So changes in the actual distribution of the response and not just its value over time can have an influence on the results that we obtained.
Another assumption we often make is that the response function is stationary-- so that it applies the same way at every time point in the experiment, when in fact, people adapt to experiments. And their responses could be different at the beginning versus the end of the experiment. They could be different as a function of which context they're in when they're responding.
And then finally, we tend to assume linear, mutually-independent influences of the predictors in our model on the response, which may be true, in some cases. But it's a strong assumption to make, especially given the growing success of, say, deep neural network models, which are the best existing simulations of many kinds of complex cognitive processes that we have currently in the domain of speech perception or visual perception. And one reason why those models appear to be so successful is that it's precisely that they relax this assumption-- that they can flexibly integrate and combine properties of the environment in nonlinear ways.
So again, today we're talking about this approach that I'm calling Continuous-Time Deconvolution Regression, or CDR, where the primary contribution is to relax assumption one. But then I'm going to talk about some generalizations that can also relax the remaining three assumptions that I'm talking about here, too. But the idea is that we can kind of get past the necessity for some of these simplifying assumptions if we simply allow the response function estimated by the model to have a continuous time definition.
So that's the assumption we're targeting here with CDR. So to kind of give a basic, conceptual overview of the approach, I'll start with something that's not CDR. So imagine that we have this time series. We've got some stimuli, X, of different values. And they're sequenced in time in this way. And then we measure some responses to those events, Y.
And so we want to use the values of X in order to generate a prediction for the values-- the corresponding values of y. So in this example, we're making the assumption that the value of X determines the corresponding value of Y-- the one that's kind of co-located in time-- and no other values, no other aspects of the stimulus sequence affect Y. So this might be a model that would actually be used, for example, to study reading, where we might assume that how surprising a word is in context-- imagine that that's what X measures-- governs how long a person spends reading a word. That would be what Y measures.
So as I talked about before, that may not be true. We may have this kind of lingering or residual effects of information processing load from previous words or previous stimuli on the response that we obtain at the magenta time point. And so a common way of relaxing this independence assumption is to just add in some regressors to some finite number of preceding words allowing us to get like temporally-weighted effects.
So here, we are predicting one not just based on its corresponding surprisal, but also on how surprising the preceding word was, as well as the word before that. And so we can get this kind of discrete function that says the surprisal kind of jumps in this particular way. And this kind of model will allow us to estimate that from data.
And so this approach is called many things depending on your field. So some will call it distributed lag regression. Some will call it spillover. Some will call it finite impulse response modeling. It's the same thing, but the crucial commonality between all of these approaches is that you have this indexical notion of time-- that it kind of progresses as a function of order but not as a function of the actual timestamp of these events, X, and the responses, Y.
So if the events, X, are equally spaced in time, that's no big deal. You can just easily map from the order of an event to its timestamp. But when we're talking about naturalistic events and naturalistic stimuli, we're often dealing with sparse, irregularly-spaced events. And that scenario is where these kinds of discrete time approaches really break down.
So the idea being that imagine, again, that we're modeling reading times here. There may be a difference in the amount of influence that a preceding word has on Y as a function of how long ago that word was fixated. So if it was fixated 100 milliseconds ago, there may be more residual processing going on than if it was fixated 3 seconds ago.
But this approach erases all of that information. We're not looking at the variation and duration of the events, X. We're just looking at when they occurred relative to each other-- their relative order.
So the idea behind CDR is to take that variation in time into account. So instead of just looking at when events in X occurred relative to each other, we actually look at their time stamps, as well as the time stamps with which the responses, Y, were measured. So here the same five events that I've been kind of cartooning before are shown here on the real timeline, along with their associated responses.
And so instead of defining the response function as a vector of discrete weights associated with particular points in time or particular positions in time, we're defining the response function as a continuous function that has some parameters that describe its shape. So that's plotted in green, where IRF is short for Impulse Response Function.
And so what this buys us is the opportunity to consider the joint influence of all the preceding stimuli in the experiment as a function of how long ago they occurred in time. So we can take the response function associated with each of these impulses, each of these events, and then use the response function to compute a convolution weight for that according to how long ago it occurred relative to the time point that we want to predict-- say, this magenta bar, Y-- which means that our prediction for Y is conditional on the entire experiment and not just on the co-located event.
And what's nice about this definition is it makes no requirements as to how X and Y are aligned or measured. So because of this continuous time definition, we can rejigger where these responses get sampled. And we can apply the exact same model.
So by separating out the ways that events occur and the ways that responses are measured over time, we can get this very flexible mapping between any two synchronized time series. And I'll say at this point that the IRF-- so I'm all about relaxing assumptions here. That's the motivation that I've given. But now, we're assuming a new thing, which is the functional family for the IRF.
So we can assume something that's simple, like, say, like a Gaussian-shaped IRF that has like a location and a variance. And that would be a strong, simplifying assumption. But the mere fact of being parametric, we still can have very expressive assumptions about what the shape of this thing will be. So we can assume a simple kernel. But we can also assume, say, a sum of arbitrarily many Gaussian functions, so it winds up looking like, for example, a kernel density estimator that can have arbitrary shape.
And we actually-- I'll show in a second-- can drop in a neural network here and have it do the same work. So this assumption of an IRF, I think-- or an IRF family-- is not particularly constraining. At least, it's less constraining than some of the assumptions that I'm using this in order to relax.
So to put this picture in a slightly different way, let's imagine this is our naturalistic stimulus with irregularly-spaced events that are going to trigger some response. We want to estimate what that response is. We can take that stimulus, and we can measure multiple different kinds of responses to it and capture their effects using the same model.
So we might take behavioral measures of the reading times associated with, say, if these stimuli are words in an experiment. So they're going to be spaced out in time in the same way, but we still can capture these diffuse effects of words from the preceding context. But we can also run this in an fMRI context, where we have these slow but regularly spaced scans that are taken by the scanner. And then we can map the expected influence of each stimulus on a given scan as a function of their distance and time without needing to make any assumptions, like that the hemodynamic response has a particular shape.
And then EEG or MEG, it's the same thing as MRI. It's just much more densely sampled. So here, we can use CDR in order to estimate event-related fields or potentials associated with particular words in an experiment, but in a data-driven way that doesn't make assumptions about, for example, the validity of averaging multiple epochs together in a setting where the events or the responses to events are going to overlap. So the upshot of all this is that we have a unified statistical framework for analyzing these kinds of time series regardless of the temporal structure of the stimuli and the responses.
So the approach works. I've applied it to reading-- a couple of different types of reading experiments, as well as fMRI. It consistently outperforms established approaches using linear, mixed-effects regression or generalized additive models in predicting unseen responses. So a generalizes better than alternative approaches, which at least suggests that it's tapping into generalizable features of the true response function that are difficult to capture using other means.
So that was all about assumption 1, like relaxing the simplicity of the response function by defining it in continuous time. I'll now briefly go over how we can use the same basic framework and just kind of some light generalizations of it in order to relax some of the other assumptions.
So assumption 2, which is that the variance is constant, can be relaxed by using an approach called distributional regression, which is you just allow all of the parameters of the predictive distribution over responses to depend on properties of the stimuli, not just their values. So typically, we're estimating the change in the expected value of the mean of a response as a function of the predictors that are thought to influence that response. But we can also look at the influence of those predictors on the spread-- the variance of the response and say a Gaussian predictive distribution over reading times, or the BOLD signal, or what have you.
So the same basic framework applies. We just model all the parameters of whatever distributional assumptions we're making about how the response is distributed, which is nice. That can give us useful information that's otherwise difficult to obtain about how, for example, the suprisal of words might affect how much spread you get in the response to those words.
And then assumptions 3 and 4, which are, respectively, that the response function is stationary and that the effects of predictors are linear and mutually independent can be relaxed by dropping in a particular kind of deep neural network as the IRF that I showed in the earlier plot. And that network relaxes both of these assumptions by allowing predictors to interact with each other in arbitrary ways, to have nonlinear effects on the response, and to depend not only on the values of the predictors in the events, but also on their timestamps, and therefore change over time, as well as change as a function of the preceding context.
So I'm skipping how the network does that for the sake of time, but I mean the upshot that I'm trying to get across is that this gives us a pretty flexible model of time series-- that we can relax a lot of the assumptions that we've previously had to make while still getting an interpretable description of the system that we're trying to model. So those are the main, I think, advantages of the approach.
So in a nutshell, what CDR does is, given a definition of the model-- that is, which predictors we're going to use to predict the response, which responses we want to model, and what kinds of assumptions we're making about the functional family that we want to estimate-- we're just going to use stochastic gradient descent in order to optimize a regularized likelihood function. And that'll give us the best-fitting model that we can obtain using current state of the art deep learning approaches or machine learning approaches to get the parameters of the response function that we care about.
And then, we can inspect those estimates and see what they show us about the response. And then, we can do statistical testing on them by evaluating them with respect to some null model of interest in order to see whether or not we get improvements from some aspect of the response function that we care about theoretically.
So that was a brief overview. At this point, I'll pause for questions.
AUDIENCE: Do you have to use some of your data and put it aside to first come up with this estimate of your IRF and [INAUDIBLE]?
CORY SHAIN: Yes, you should. Yeah, so we do assume relatively big data that we're going to try to partition into training and testing sets. Cool. Well, that's a good segue way into the caveat, so that's I'll start with next before everybody gets too excited about this approach.
Yeah, it's a data-hungry approach compared to, say, linear regression. Exactly how data-hungry is an open question that I'm trying to figure out. How little data can we get away with in synthetic experiments and still get a pretty good estimate of the response? So hopefully in a few weeks, actually, I'll have some answers for anyone who's interested on that. But it definitely requires more data than more basic approaches that make stronger assumptions.
And as a result, it's compute-heavy. It can take-- I mean, a model that might take seconds in a linear regression context could take hours or even days to train here. So that has implications for the kinds of response variables that you can model. So it's pretty common, for example, in MRI to do voxel-wise modeling using linear regression, in which case you have thousands of distinct models that track the time courses of each individual voxel in the brain.
That kind of approach is a lot harder to do here just because of the amount of computation necessary in order to estimate it. But I will say, if you use the neural net version, the system can kind of take advantage of mutual information between voxels in order to share weights and accelerate some of that training.
So you actually can fit these in a multivariate setting where you're simultaneously modeling all voxels. And that can-- although that's still certainly more computationally intensive than linear regression voxel-wise modeling, it's feasible, at least for moderate data sets. But yeah, it takes more computing.
The likelihood function is non-convex. So there's no way of guaranteeing-- this is the case with any deep-learning model-- that you have found the best fitting model to the training set. And because the model is so powerful and flexible, that isn't probably the thing that you want to do, anyway. What you want is the system that will generalize the best.
So for that reason, the training set itself-- so models fit to the training set-- is not a particularly informative metric. Because it could reflect overfitting, if you don't have any way of diagnosing whether your model's effects generalize. So for that reason-- I guess this left it for last-- it is a good idea to hold out some data so you can use it for evaluation and statistical testing.
And then the point that I skipped over here-- uncertainty intervals in the model are approximate. Using different approaches from variational Bayesian deep learning, we can get some idea of how certain the model is about its parameters. But the posteriors aren't going to be exact, and so they shouldn't be used for testing.
Yeah, those are the major ones I've run into. Yeah, so if you hold out data, or if you do another experiment, then there is a gold standard. You have the set of stimulus events associated with data that your model wasn't trained on, and you also have the measured responses. And so you generate predictions using those stimuli for the measured responses, and then you compare how good the predictions are. Does that answer your question?
AUDIENCE: [INAUDIBLE] know it was the real IRF?
CORY SHAIN: Yeah, so the question was, how do we evaluate the model's estimates against a gold standard? And so that we can't do, that's true. Because we don't know what the true response function is. That's the case for any statistical model of an unknown response function like the brain.
So no matter whether we're using CDR or something else, we don't know what the true response is. We're trying to make inferences about what that response is likely to be. And a good way of doing that is looking at model fit-- so not comparing the IRF to some gold-standard IRF, but instead comparing the predictions derived using the IRF to measured responses from an experiment. So that's the approach that I'm advocating.
In synthetic data, we do know the gold response, so we can assume a particular response function, generate data using it, and then fit the model and see whether it recovers the true synthetic model. And there, the answer is yes. I don't have demonstrations of that here.
But we're going to fit-- those of you who have computers will fit to a synthetic data set, and you should see these the true model emerging in the estimates. But yeah, as soon as we apply it to things that we don't already know about and are therefore scientifically interesting, then we have to use proxies like model fit in order to select models.
AUDIENCE: --limit to how sparse the sampling frequency can be relative to the frequency of the actual response?
CORY SHAIN: Yeah, there is. That's a good question.
AUDIENCE: Could you repeat?
CORY SHAIN: Sorry. The question is, is there a limit to-- correct me if I'm wrong-- but is there a limit to how sparse the response sampling frequency can be relative to the predictors in order to-- so the stimulus. So if you get a lot more stimulus events per unit time, then you get responses, does that limit the inferences that you can make about the response function?
AUDIENCE: [INAUDIBLE] how you record the responses. Let's say, fMRI has a slower or longer interval for each response. Can this still work?
CORY SHAIN: It does. But the sampling frequency of the response, I think, independently of the sampling frequency or the frequency of events will constrain the temporal resolution of the estimates that you can get. Yeah. So there's no miracle here that can inject greater temporal resolution into fMRI than it inherently has. So you can't go beyond its Nyquist frequency in terms of the amount of resolution that you can get about responses.
So even with fMRI, you're generally-- the response function is going to be dominated by hemodynamics. Differences in cognitive time, underlying time courses at the neuronal level, won't really emerge clearly there unless they're very large, large enough to shift the hemodynamic response in a measurable way. But a few dozens or hundreds of milliseconds here or there probably won't emerge in even these kinds of models of MRI, just because the signal doesn't support their discovery. Does that answer your question? OK.
AUDIENCE: The model fits one IRF for all types of stimuli, right?
CORY SHAIN: No, not necessarily. So you get to decide that as the user-- how you want to tie weights in the model. Yeah. Tamar?
AUDIENCE: [INAUDIBLE] tried testing a MRI-- sample fMRI data set with a long interval, you can see whether it recovers the hemodynamic fold signal, an assumed function, if it's similar?
CORY SHAIN: No. I've only applied it to naturalistic fMRI, but it recovers something that looks like the canonical HRF there. And then--
AUDIENCE: [INAUDIBLE] method leads to the same kind of assumptions that classical fMRI do?
CORY SHAIN: Yeah, I mean, I think it does. Because with a very-- a neural net versus a relatively assumption-free model, you get the HRF from naturalistic data even without spacing things out. So even with compressed and overlapping responses, the same basic response profile emerges. But then, yeah, I've tested in synthetic settings with a bunch of different configurations, and it consistently finds a good approximation of the true model there. So I think between those two findings--
AUDIENCE: [INAUDIBLE] the assumption is the correct one in the fMRI, right?
CORY SHAIN: Yeah. Yeah, so I think so the result is that the resulting estimates from applying it to naturalistic data do look like the HRF. But they still fit better than assuming the canonical HRF. Mm-hmm.
AUDIENCE: A question from online-- how can we use CDR for analyzing naturalistic movie fMRI data where stimulus events are not explicitly defined?
CORY SHAIN: Yeah, so it's a great question. And I don't have to repeat it, because everybody here heard it? OK. So you need some sort of representation of events in order to do this kind of modeling. So you need like two fixed points and one unknown.
So in this case, the assumption is that the fixed points are the measured responses and some representation of the event sequence. And then the question is, what function mediates between those two things? And we're going to use the deep learning or machine learning in order to get an estimate of that.
If you know neither the event sequence nor the response function, then you don't have enough information to fit this model. I think that in practice, it's often possible to come up with some sort of coding of movies or other naturalistic stimuli that could allow you to at least fit responses to gross, easily discoverable events in the sequence. But yeah, without that, there's not much that can be done here, because the response functions m by definition, responses to events. So it needs to know what's going on underlyingly.
Move on to pitfalls, which I'm distinguishing from caveats as, these are things that are easy to forget and mess up if you're used to a linear-- or really any kind of existing analysis pipeline. And this is a combination of my own experiences as well as experiences of others who have tried to pick this up and, I realized, oh, wow, that wasn't clear. So one thing-- in some domains this is kind of obvious. In others, it's less so-- is that the predictors and responses are different things. They're distinct time series in this approach.
So if you're doing neuroimaging, this is an obvious thing. Your predictors are going to be whatever-- however your events in the experiment are coded. And your responses are going to be fMRI or EEG measures.
But when you're using-- I think this becomes an issue in behavioral data when you're, for example, modeling like reading times, where there may be just one data set that represents the time series. And it includes a column for your response variable, but also columns for your predictive variables. And so it's important to remember that the two quantities are actually represented differently by the model.
And this has implications for this point-- I did this in the wrong order-- which is not dropping predictors. So it's common to filter your data according to some sort of outlier detection algorithms. And that's valid in this approach if you are applying it to the responses. So I'm not going to model the response in this chunk of the time series, because I don't think that it's a good chunk according to whatever criterion I've defined.
But you shouldn't do that to the predictor sequence, because then you're kind of telling the model that whatever happened here didn't happen. Those are the events that the responses are supposed to be associated with. So for example, if you're running this on a reading experiment and you're like, I don't want-- I want to drop all fixations that are longer than such and such.
If you drop that from the predictors, then you're going to wind up distorting your model. So just keep in mind that the predictors and the responses are different things and that any data censoring or partitioning-- also partitioning data into training and test sets and so on, that should only apply to the response, not to the predictors.
A second intuitively-obvious thing but that's easy to run afoul of is that everything has to have a timestamp. A lot of times, people who collect data, especially open data, haven't done this, because they weren't planning on analyzing things in this way. But using a continuous time model, you need to know when things happened on the real timeline.
And if you can't either get or reconstruct that information, then you can't use this approach, unfortunately. So that can kind of-- I've run into that. It's constrained my ability to use certain public data sets that don't have that information.
You need to think about what constitutes a run. Basically, where am I going to chunk up the time series such that parts of the time series are treated as interdependent versus independent? So for example, in fMRI, this would probably be just the corresponding to an fMRI run. So I'm going to have a time series associated with each time-- each distinct time a participant sat down in the scanner and stayed still and got scanned.
But yeah, so you got to keep in mind where the events are going to-- or where the time series themselves are going to be chunked up. Because in a typical experiment, your table will represent multiple concatenated time series.
And you want to avoid either spilling over into the wrong time series things that aren't actually consecutive in time or chunking things up too much, where things that really are underlyingly part of the same experimental setup and time series are treated as independent when they're not-- when they may not be, in fact. And then, as we've talked, about it's important to hold out some data for evaluation.
Oh, yeah, choosing a reasonable time scale-- this one I've run into recently. So a lot of times-- well, sometimes experimental software will time stamp events as the number of milliseconds elapsed since January 1, 1970, or whatever default representation of time, which results in huge numbers. And because this approach just assumes 32-bit floating point underlyingly, you can actually wind up with really weird degeneracies that you don't expect.
So just in general, try to avoid huge numbers in your data frames. This often isn't an issue if you're just doing linear regressions in R, but it can be here. So I recommend starting your time series timestamps with zero indicating the beginning of the run, rather than the beginning corresponding to, for example, midnight on January 1, 1970. So that's the end of my overview. Yeah?
AUDIENCE: In terms of the held out data, does it have to be cross-validated or is it always like you leave something out and then that's always held out? And then second thing, what about in terms of circularity if you do some downstream, cross-validation analysis for a different purpose?
CORY SHAIN: Yeah, that's a great question. I mean, I don't think there's an inherent answer. You can do whatever makes the most sense for your needs as an experimenter.
The challenge with cross-validation is that you can't use it for both tuning and testing. So one advantage and one reason why a lot of machine learning applications will have a fixed tuning and test set is so that they can try a bunch of different variants of their model on the tuning set but then keep the test set unsullied.
So if you're not sure about your model definition or you're worried about the influence of some of the parameter settings for fitting and you want to be able to optimize those before you do any evaluations, then I would not do cross-val. But if you're 100%-- you've got everything fixed in your mind, and you've pre-registered these are the analyzes that we'll do, then cross-validation makes more efficient use of your data.
AUDIENCE: But what about the downstream cross-validation? Say that I modeled my data, and then I want to do something else afterwards. Is that then weird that you use cross-validation to estimate the parameters and then you--
CORY SHAIN: It can be. I think it depends on how different the something else is from what you already did. I think it's difficult to answer in the abstract. But there's definitely more risk there than if you've got a special set, yeah.
AUDIENCE: I have a somewhat similar question. Out of curiosity, have you looked at, for example, let's say you're trying to model fMRI data. Is there decent-- across data set-- activity? So for example, let's say that you have two different, continuous naturalistic time series of words or something.
You extract the same features from both. You data set train on that. Test another one and collect it, for example, on a different day, but same person, same coverage. Do you have any data on how that looks?
CORY SHAIN: No, I don't. That's a great question.
AUDIENCE: Can you do a short summary?
CORY SHAIN: Yeah, so short summary is, have I evaluated how well model estimates generalize to new data sets from the same domain? Is that a fair summary? And the answer is, not yet. But I've got something in the pipeline for that. So hopefully I'll know more soon.
AUDIENCE: [INAUDIBLE] personalized HRFs, if that would improve upon just assuming the canonical HRF scenario.
CORY SHAIN: Yeah. Well, for that question-- sorry. The question is whether learning a personalized HRF is better at describing the data than assuming a fixed, canonical HRF shape. And for that, I've shown that it is, for at least one data set. But the question about whether the estimated data set works on a different data set that it should work for is a good one. And I haven't done that yet.
If you haven't already, go ahead and install the software. And I actually think this slide is worse than just going to the tutorial page that was linked to in the email. So if you pull up the Google Doc-- please, somebody, let me know if there's an issue accessing it-- there's really two ways that you can install-- I'm assuming that you all have Python installed, or Anaconda, or something of that nature.
So if you're an Anaconda user and you want to put this in an environment, there's instructions for doing so on the ReadMe. And if you just want to use standard Python, you can do it with PIP using this command. So again, this is to install an implementation of the framework that I've been describing.
And this Doc also has a link to sample data and models. So there, there's going to be two files-- two ZIP files that you want to download. One is called data.zip and one called models.zip. So you want to download those and extract them into a working directory that you want to use for this. It doesn't really matter where.
And then the next thing is showing how to configure your experiment so that it can run. So yeah, so somewhere in your working directory, you should have a model's directory and a data directory. Depending on how your extractor worked, there may be an additional, nested level here, which we don't want.
So if you click on Data and it just has another directory called Data, then you want to remove that. So move the lower directory into the higher one. Same with models.
And so this is what your models file should look like. ET stands for eye tracking. fMRI stands for fMRI, and SPR stands for self-paced reading.
AUDIENCE: Cory, are there words that have meta-level [INAUDIBLE] eye tracking like [INAUDIBLE]?
CORY SHAIN: Yeah, anything that you've got in your table can be used as a regressor. So CDR demo is going to be my working directory. So assuming that CDR is installed, you can then run this command-- python-m cdr.bin.create config.
So the idea behind the implementation is that I've tried to automate everything as much as possible so that you can just run stuff using simple command-line executables once you have a definition, a configuration file that describes the data that you're going to be using for fitting as well as the kind of model that you want to fit. And then, anything in this cdr.bin directory has a help that you can run by just putting -h. So if I run this -h, then it gives you usage instructions for the script.
If I run create config -a, then it's just going to spit out a bunch of-- like a template for a configuration file for an experiment that I can then pipe into a file that I want to save it to. So I can call this-- here this is going to be a synthetic data set, just because it fits quickly. So I'm going to call synth.ini.
So this is creating a new configuration file template for me to use at the path synth.ini. So if I open up that file, then I've got this kind of verbose description of all the different pieces of the config that I can use to set up my experiment. But there's really three-- or really two critical sections of the config.
The first is a data section that specifies where the data is that the model will load. And then the second is a model definition.
So I'm going to go through the parts of this. Is everybody OK to move on? Does anybody need help with installation or data downloading? All right.
So in the data section, I minimally need a path to-- X stands for the predictors, so the stimuli.
AUDIENCE: [INAUDIBLE].
CORY SHAIN: Oh, yeah. I hope. Yeah. That better? Yeah. All right, this is going to be a path to stimuli that I'm going to use in my experiment. So this would be, for example, I keep coming back to the reading example. But this might be the list of properties of the words as they were fixated in a reading experiment in order, for example.
And then Y is going to refer to the path to the data that contains the responses that I want to model. So my experimental measure-- bold responses, EEG reading, what have you.
These are assumed to be in a tabular format. So if you don't have that-- a lot of neuroimaging pre-processing pipelines don't spit out CSV files, then it may require some kind of intervention in your pipeline in order to get a CSV that represents the data that you want to model. But this is kind of a standard assumption for these kinds of regression things. Like in R, for example, that you just have a table that represents the data that you want to use for predictions and responses.
And then you can specify some number of columns that indicate-- that the model will use to partition up the time series. So for example, if I have a separate time series for each participant in the experiment, and I have a column called participant, I might just put that in there. If I wanted it to be-- if I had multiple stories in my model or my data and I wanted a separate time series for each story, for each participant, then I could do participant story with a space in between them to say I'm going to use both these columns to chunk up the time series.
So whatever I give it, each unique value of the combination of these columns is going to-- the model will treat it as a distinct time series. And everything else will be independent. So I'll delete this, because we actually don't need this in the synthetic example. And then for the X train, the part that we're going to use is data data X-- what did I call it? X.evmeasures. Evmeasures. And then, Y train is going to be data data Y.evmeasures.
So if you look at these files-- if you just open them in a text editor-- it's a space to limited tabular file that says what-- and this is all synthesized, so these are not real. This is just generated programmatically-- but what the values are for these, in this case, 20 different predictors labeled A through T. And so these are going to be kind of simulated events in my experiment along with the time stamps, for example, associated with them.
And then, if I look at the Y measures, y.ev measures, it's the same. Only here, I have this distinguished variable, Y, that I'm going to be using as my response variable, as well as the time that Y was sampled. The other predictors here, you can just ignore.
Yeah, so I've defined my paths to data. And then here, some things to keep in mind. So for example, if I want to filter my data in any way, like remove outliers, I can specify filters here. I'm not going to do any filtering, so I can delete that.
And then here, the separator in these tables is a space. But if it's a comma, for example, I could do this. So just know how your data are formatted.
And then, output directory can be whatever you want. You just write a path to wherever you want the model to dump all of its outputs. So here we can just call it synth. So it'll create a new directory called synth, and it'll put everything in there.
Oh, and then this optional part that I skipped over on accident about-- you can also specify dev and test partitions of your data. Here we will actually not worry about that, as well. So you can either leave it there or delete it.
AUDIENCE: [INAUDIBLE]
CORY SHAIN: Yeah, so history length is the number of preceding events that will be considered. So although in theory, our convolution can operate over the entire proceeding time series, it's pretty computationally inefficient to actually implement it that way. And so we cut it off at some time point. So here, it's no more than 128 things into the past.
AUDIENCE: Quick question from online. What is the role of history length, here?
CORY SHAIN: Yeah, sorry, that's the one I was just answering. No worries. Yeah, so it's just a truncation of the history. So you should pick a generous value beyond which you don't think there will be any influences. All right.
AUDIENCE: One more quick one, Cory. What would the format be for the dev and test set tables if we were using them?
CORY SHAIN: Yeah, it would be the same. Yeah. The config assumes that your data are formatted in the same way. So you would want like a tabular representation with all of the predictors that you have used in your model in the X dev or X test paths and then all of the responses that you've modeled in the Y dev and Y test paths. If you're missing anything, the model will just complain that you don't have it, and then you can figure it out.
And then, as I said, everything needs to be time stamped. So there's the distinguished variable called lowercase time that's expected in both the predictors and the response files. So you'll get a complaint if your data don't have that.
So the fun begins here where you define the model. So here, you tell it that it's a model section by using this prefix, model underscore. You tell it that it's a CDR model by using CDR underscore-- or not even underscore. So minimally, if you want to define a CDR model, the section header needs to be this.
And then, you can define as many models as you want to of the same data, but they have to have unique names. So I could have model 1-- or 1. Can't type-- And then another one down here that's 2, and so on. No constraints. You just need to have special names for them. So pick names that you'll be able to interpret later.
So here we can just call it CDR, because we're just going to fit one. And then the formula, it's inspired by like linear modeling formula from R. So it has like a left-hand side and a right-hand side where the left-hand side is the dependent variable, then you have a tilde delimiter then the right hand side is the predictors that you're going to be using for the modeling.
So in this case, if you go back to the CDR tutorial, I've put a section called synth model formula. You can just copy that and paste that into the formula equals section, here. And I'm jumping ahead to this so that we can get models running for anybody who wants to on your computers. And then I'm going to parse this for everyone.
So once we've done that, assuming everything is configured correctly, we can go to our command line and type python m cdr.bin.train. And then we need the path to the config file, so synth.ini. And then you can do this little -m argument to tell it which model you want. So here, we've called our model CDR, so that's what we'll do.
If you don't tell it the model, it'll just run everything-- all the models in the config file in sequence. Which, since we only have one, is a viable option. But if you want to specify which model you want to run, this is how. All right.
Seems like it's running. Good. All right, so it's just going to go through-- an iteration represents a single pass through the entire data set. So it's going to run a bunch of these until it reaches its convergence criteria. And then it will stop.
With this one, I expect this will-- I don't know-- probably take several hundred iterations. But if you've got a reasonably fast computer, maybe it'll be done by the time that the session is over. But it spits out some useful stuff along the way, including what its loss is-- so the loss is going to be roughly proportional to the negative log likelihood of the model-- along with some regularization terms if you have any, including, in this case-- it's by default a very variational Bayesian model, so it also has a KL divergence regularizer.
And then, if you want to see the model's learning curves, which can be useful, you can just run a utility that's provided by the underlying library I'm using here, called TensorFlow. So python -m tensorboard.main. And then, logdir equals-- and then you want to use the path to your model. So in this case, it's going to be sent-- I can't type-- CDR.
So this is where the model is saving. So we set already that our output directory is synth, and then each unique model name gets its own subdirectory there. So if I do that, then I should be able to pull up TensorBoard.
Oh, I may have another instance of it running already. I do. That's the problem. There we go.
So this gives me some idea of the trajectory of some of the parameter estimates in the model, which I'm not going to interpret for now. It gives me convergence criteria. So basically, as all these lines go up, I'm getting closer to converged. And then, the learning curves are going to be here. So this loss by ITER is going to tell me how my losses are changing with each iteration.
So yeah, so as that model runs, it will eventually spit out some logs. So I think it's going to probably do it every 100 iterations. So in a few seconds, we'll get some information about the model's fit.
And it's going to get-- and one of those logs that's of interest is going to plot the estimated impulse responses at this point in training. So you get those incrementally, and you can kind of check. Because some of these models are long-running, and you don't have to wait until the very end to see if there's something weird going on.
But meanwhile, I thought it would be useful to actually look at some fitted models. And that's where I gave you the models file, as well. So while your synth model is training, you can go ahead and do some-- we can do some other stuff with the previously-fitted models.
So if we look at the models directory, again, there's going to be three different data sets-- or models fitted to different data sets. And I'll just show the fMRI data set, since I think it's probably going to be familiar to the largest number of people here.
So if you look at the contents of the fMRI data set, there's one model there. It's called CDR.b1024. And it also has a config file. So if you want to see of how that config file is structured, you can look at that. But it's the same basic setup as the one we looked at. It's just got some extra hyperparameters specified for the model.
So if we want to plot what the model has estimated, then we can do cdr.bin.plot and then give it the path to the fMRI model. So it's going to be models, fMRI, CDR. If we run that, it will produce an impulse response plots. Oh, CDR underscore b1024. So sorry. That's the name of the model.
No, I'm losing my mind. It's the config. We give it a pointer to the config file, and then we tell it which model we want to run, which is CDR underscore b1024. So to review, to plot, we're going to run CDR.bin.plot. We give it the path to the config that defines the model, and then we tell it, using the -m argument, which model we want to run, which is named CDR b1024.
And then, if we go to that location, then eventually, once it's done plotting, there'll be some PNGs that get spit out here that will tell us what the model's estimates look like for this domain. All right, so I'll blow this up.
So these are impulse response functions to different variables in a naturalistic fMRI data set where people listen to stories. So basically, what you're seeing is the change in bold response-- on the y-axis-- associated with some delay after observing a word-- so after the onset of a word and the auditory stimulus-- on the x-axis. So 0 represents no delay-- the instant of word onset. 5 represents 5 seconds after word onset. 10 is 10 seconds after word onset, and so on.
So as you can see, this is roughly the same shape as the HRF that I kind of cartooned for you during the presentation where you get little response immediately, but then you get this kind of peaking primary response that happens around 5 seconds after the word is presented, which then dips. And then, in the case of several variables at least, becomes a negative undershoot. So we actually get this kind of-- we've gone too far in delivering blood to this region, so we're actually going to dip.
And then you get the response broken down by which predictor we're looking at. So we get a distinct predictor for rate. Rate is essentially an intercept term for this model which represents the generic response to an event regardless of its properties. So if a word happens, how will blood oxygenation in language-selective areas change, regardless of what that word is in the average case?
And then, these other responses reflect deviation from that. So how much more or less will the response change if that word is more surprising than the average or if that word has a higher like acoustic loudness in the auditory stimulus?
So each of these predictors actually has its individual set of parameters that are going to describe not only the shape, but also the amplitude of its response. So I can get a differential sensitivity over time and different shaped responses to different kinds of predictors.
Now, it happens to be the case in this domain that, again, as we said earlier, the response is dominated by these hemodynamics, which are largely shared across the different predictors. So we get very similar looking responses that vary only mostly in their amplitude. That won't be the case with other kinds of things. Uh-huh.
AUDIENCE: [INAUDIBLE] shared, I wonder how you would differentiate whether a response has to do with, let's say, how loud it is versus how surprising it is. Because a surprising word can also be loud and vice versa.
CORY SHAIN: Right. And so the assumption-- so one assumption you can make is that these are going to be these effects are going to be additive, so that whatever contribution-- surprisal on the one hand and loudness on the other-- make, they get summed together. So they're in kind of a zero sum with each other in the model. That would be a standard assumption from linear modeling, as well. Or any kind of hemodynamic response modeling will typically make that assumption.
The model I'm showing here actually it doesn't make that assumption, because it's fitted with a neural net that can actually allow saturating effects of both things-- so our nonlinear effects. And so for that, we actually-- in this context, if we want to test whether these things make separate contributions, then we might want to fit a null model that constrains the two to be additive with each other, for example. In this case, right. We're just visualizing changes over time.
But we can also visualize changes on the predictable dimension, as well. So here, we're looking at the temporal dimension. But we can also look at-- in cases where we fit a nonlinear response function-- we can look at what that non-linearity looks like in the surprisal domain, for example, or in the sound-power domain. So however we choose to define the model is going to determine the flexibility with which our model can adapt to these different aspects of the response function.
So I think it would be confusing to turn there, because it's like jumping too much too fast. So for now, I think, I'll instead show how we might define the model in order to enforce or relax different kinds of assumptions about the way it works. So I'll return to our synth model here.
And I'll first show you what a typical, I guess, linear model might look like-- so not one with any convolutional structure. So we'll call this LM. If we just delete some of these terms, then we just have this 20-predictor linear model of our synthetic data.
So here, we're just looking at, what are the contributions of all these predictors separately to why? This is a valid model in CDR. CDR will assume that anything that doesn't have a specified response function will use what's called a Dirac delta response function, which is just basically a response function that's 1 at the same time and 0 elsewhere. So we're going to delete the contributions of all surrounding context, except for stuff that occurs at the same time that we're measuring the response. So that reduces to a linear model of the predictors on the response.
This isn't a good way of fitting such a model. You can do so a lot more efficiently in ours. This would not be a recommendation. But just to get a concept for how this works, then these kinds of linear effects are possible to fit.
If we want to add in convolutional structure-- sorry. I should still keep this called CDR. CDR conv-- then we need this C term. So C is just like a two-parameter function in the syntax of these formulas, where the first parameter is the set of predictors, and the second parameter is the response family.
AUDIENCE: [INAUDIBLE] in the EKG, are those predictors?
CORY SHAIN: Yeah. Yeah, so the right-hand side is going to describe all the predictors. And then, in this context, we additionally describe not only which predictors we want to attend to, but what kinds of functions we want the model to be able to fit of those predictors and describing the response.
So for example, if I do like-- here. Let's do this here. If I do this, then I've got these 21 different predictors, and then I have a response function that's a neural network. So take a neural net and fit the response to these predictors.
I can constrain it to be something more specific, if I have some reason to do so. So for example, I might want it to be in the family of double gamma hemodynamic response functions. So I can replace it with HRF if I know that I'm fitting fMRI data, and I want to take advantage of prior knowledge about that domain.
Or in the case of reading, I might assume that we've got a response that's going to be maximal at the current word and decay over time. And it's just a question of how steep that decay is. And so I might fit an exponential family with a one-parameter, exponential response function.
The software implements several response functions. And if you're interested in using one that's not available, I'm happy to try to help you implement it. But yeah, the most flexible by far would be NN. Neural nets are universal function approximator, so you should probably get something close to what you need using this.
So I can actually specify different response functions for different predictors. So I might say, I want these predictors to have an exponential shape. But these, I don't know about. So I'm going to give them a neural net shape or something like that.
I can actually also specify fixed values for my predictors, if I don't want them to be estimated. So in this case, here are the predictive values that are assumed by, say, SPM for fMRI modeling-- so a double gamma hemodynamic response that has a five-parameter shape with these fixed values.
If I specify those values in my model, then this is equivalent to just pre-convolving with the canonical HRF and then fitting a linear model to that. So again, not something you would actually want to do, because this would be a way more computationally-intensive way of doing it. But just a demonstration of the fact that we can fit familiar kinds of models within this space. But then we can also relax them, just by deleting, for example, this parameter, in which case C, which represents the ratio between the overshoot-- the primary response-- and the undershoot component would be fitted by the model, rather than specified in advance.
There are quite a few other things that can be done, I think-- since Ethan asked about it, I'll also mention that we can tie parameters by specifying a name for our impulse response function. So if we say IRF ID-- that's the name for the tyer. That's the keyword argument for tying parameters-- equals something. Let's call it, I don't know, single HRF.
Then I'm going to fit one HRF-- so one five-parameter response function-- that will be applied to all of my predictors in the same way. And then, my predictors will only be able to vary in the amplitude of their response to that function. So this is a useful thing for MRI where, again, in most cases, this is a reasonable assumption to make that there's just one response on a given brain region whose amplitude is modulated by your experiment.
And then, the other thing that I think is going to be widely used would be mixed models. So here, I've only specified everything at a population-level. So I'm assuming that the effects are going to be the same for all entities in my experiment. But that may not necessarily be true.
So the traditional syntax for mixed models in R would be like a plus b plus a plus b given ranef. So if you've ever fit models like this before, this means I have population slopes for variables a and b, as well as random deviations of both of those slopes for each level of some random grouping factor. So this may be participant, for example. So I'm going to allow each participant's slopes for a and b to differ.
Only it doesn't just have to apply to slopes or magnitudes. It can also apply to the parameters of their responses. They may have differently-shaped responses. So for that, we just need to put the whole convolution call in the random effects component.
So here, we're fitting a neural net IRF of a and b. And here, we're allowing the parameters of that neural net to differ for each participant in the model. So this falls back to the population-level estimate if you predict on data from participants who are unseen in training. So you should be able to use it kind of gracefully on any kind of prediction problem, even for new data.
I guess I'll open it up for any practical questions. Did anybody's fit-- anybody's synth data finish? I can show you-- I don't think mine did, but I can show you where things stand with it Yeah, so this is roughly what our response function looks like on the data that my computer spitting, so far.
And then, if I pull up the true response-- nope, not there. Here-- you can see that it's relatively similar. The way that it's plotting with this legend is causing it to squish the y-axis, so it looks less steep. But it's getting pretty close to finding what it should.
Right, in a completely simulated data set.
AUDIENCE: Yeah, and [AUDIO OUT] of some of these start at 0, which those are just-- I mean, real data, we expect to have some [INAUDIBLE].
CORY SHAIN: Yeah, it's a good question. And yeah, certainly that wouldn't be a good model of, say, ERP data. For other kinds of things, like in the reading domain, these things do make sense. Because a word will have some influence of its own properties, or at least it could, in principle. Yeah. Right.
You can use-- if you know that there should be no response at time 0, then you can use a kernel that will disallow that. So you could use, for example, a gamma kernel. But yeah, this is just to demonstrate the flexibility that you don't have to make that assumption. You can fit whatever kind of response function you might need to.
I thank you very much, all of you, for your attention. I know there was a lot here. But I'm very available by email to chat with anybody who wants to use this. So I will be your personal help service, because I would love to see some uptake. So thank you very much.