Decoding cognitive function with MEG: Recent advances, challenges, and future prospects
May 8, 2019
May 6, 2019
All Captioned Videos MEG Workshop
Dimitrios Pantazis, MIT - McGovern Institute for Brain Research
DIMITRIOS PANTAZIS: So the next topic is about decoding, using decoding methods to access the information contained in MEG measurements. And of course, typical methods, typical univariate methods in analysis involve treating its variable-- which can be a sensor or a source-- as an independent piece of data, and then, conducting statistical tests separately for these variables. And of course, in the presence of multiple variables, you can have multiple mask univariate analysis, in which case the same model is used to feed in different variables.
However, this is not necessarily the most optimal approach. It's certainly the most easy to implement. However, the point of this talk is to discuss the use of multivariate methods to extract information contained in multiple variables in distributed patterns of measurements across these variables. And throughout this talk, you may hear me talking about multivariate pattern analysis, multivariate classification, multivariate decoding. And all these will be used interchangeably meaning the same thing, which is applying a classifier in multivariate data
Though multivariate methods, in general, encompass other approaches, other techniques, as well. But however in your science, classifiers are, by far, the most popular tools to use. And you can see, in sources and in sensors, you can see how different patterns can be useful to contain information.
Let's assume the case of a single source, or a single sensor. And in this case, the point of these bar plots is to show that it can be that the corresponding source or sensor does not carry information about the condition-- condition A or condition B-- and use the same kind of activity. So we cannot use these source data to discriminate the different conditions.
However, once you have a different sources or sensors, things may become more interesting. In this case, in the presence of second source or second sensor, the patterns may start-- when plotted in a 2D space, where one axis corresponds to profile of one source, and the other axis to the activation profile of the other source, you may see that the pattern has become separable, now.
And this is exactly the strength of multivariate methods. By having multiple variables, we may obtain patterns in high dimensional spaces that separate these conditions. And this is the cornerstone of the great usefulness of decoding techniques.
And they have been used very extensively in fMRI to decode all kinds of information, including visual features, visual objects and scenes, top-down attentional processes, working memory, episodic memory, and so on, so forth, all kinds of cognitive function. And increasingly, we see them being used in MEG, certainly here in MIT. They're becoming more and more popular.
And I believe they will be more popular in the future. They have been used for a long time decoding methods for some very typical applications, such as brain computer interfaces, measuring disease progression-- such as Alzheimer's disease, for example-- or developing neuroimaging-based lie detectors. However, all these applications give emphasis to prediction. We want high classification accuracy, high decoding accuracy.
But now, we are asking more of these methods. We want to use decoding tools to interpret brain function, to understand brain function. And this introduces several challenges as we will describe in the following. And the emphasis in my talk will be using these decoding tools to interpret, understand brain function, not just extract high decoding accuracies.
So there is a lot of information encoded in MEG signals. And when applying these tools, one may be surprised how often we're able to decode different types of information. It's easy to be lost in these multi-dimensional patterns corresponding to many different dimensions, time, space, even spectral dimensions that are not necessarily saw in here. But you can see the complexity in this data.
However, applying these decoding tools, we can extract information. And MEG and EEG, they have been used to characterize the processing of simple visual features, such as the position orientation of contrast edges, that I would actually describe later on. A complex visual feature, such as a presentation of objects and scenes, a lot of work here in MIT from different groups and others, of course, elsewhere in the world.
Auditory representations, temporal maintenance of information and working memory, visual motion, and of course, developments in methods, these are some of the articles. And every year, I see more and more published, which is very encouraging. And that reflects the usefulness of these methods.
In terms of the conceptual framework, I think it is quite straightforward. And it is explained here, in this slide. I do so here, overlaid in the same axis 306 time series, because our instrument has 306 sensors of magnetic fields. Shortly after the onset of visual stimulus, indicated by dashed line, we see deflections of magnetic lines in every millisecond.
Let's say, for example, here 100 milliseconds, we capture the topography of magnetic fields emitted from the human head. And this constitutes a pattern, a time result pattern. So now, assume a hypothesized experiment where we present phases in multiple trials.
In every experiment, we present the same stimuli several times to increase the signal to noise ratio, because individual real time signals are contaminated by ongoing brain activity and other defects. So we need to repeat this stimuli multiple times to obtain reliable, evoked responses. So in one trial, let's assume that we obtain this pattern at this referenced time instance, 100 milliseconds.
That is an example pattern, and another example pattern for a different trial, and another example for a different trial. There are some similarities. And that's not a coincidence. I designed it to look visually pleasing to the human eye.
It's not reflecting the reality. But still, it will make the case. And now, assume that we are displaying, let's say, body parts and a hand in this m condition B. And then, we obtain different patterns in different trials. And again, they look-- there is some sort of meaningful signal, there.
And I will mention this as a training set, for you humans to recognize these two conditions. And now, I will use those training sets and have the trials. And I will ask you, what would you think this pattern represents, originates from?
And you don't need to answer. It's not a difficult question. You will all say, hand. Thank you very much, and the same here.
You will say a face. And I will say you are correct, because you are excellent as humans. And I would say you have 100% accuracy. And that's really what is happening.
Only we don't do it with humans, but we do it with algorithms, because it would be too tedious with humans. And they wouldn't really understand the data.
So that's what is the decoding approaches in MEG. And basically, the most useful aspect of these-- which is time-resolved decoding-- because I just described a single time instance in how we can do decoding for a single time instance. And with algorithms what we do is, basically, represent these patterns as pattern vectors. So we can coordinate sensors into a feature vector, 306 measurements again.
And then, we have our conditions. We present the visual stimuli. We separate the trials-- the patterns-- in training sets and testing sets. And then, we apply a classifier instead of humans, support vector machines in this case, to extract the classification accuracy.
And we do this several times with random assignments of patterns to the training and testing sets, so that we obtain an average performance of the classifier. And we report that classification accuracy for this corresponding time instance. And of course, by repeating the entire procedure for every time instance, for every time t0, we obtain a decoding time series.
And these are very useful results, because they allow us to localize in time the information contained in measurements and make measurements. And assuming that we have several stimuli instead of just these two, one can exhaustively apply this approach for all pairs of stimuli. Let's say we have multiple faces, objects, and other types of images.
So for every pair, we can populate this decoding matrix. So the element here would be the decoding accuracy, corresponding to this pair of stimuli. And eventually, we get a pattern.
We call this also representational dissimilarity matrix, or decoding matrix. And it shows how the brain represents different stimuli one another. And that's very useful content to have, as you will see very soon.
And to use as an example, to demonstrate this, I will use the decoding of the orientation of contrast edges, which is a very basic feature. And the stimulus set in this experiment was 6 grading patterns of different orientations in a step of 30 degrees. And by now, you should be familiar with what is displayed here, the pattern classification for different stimuli here, so on for 90 versus 30 orientations.
But it can do it for all pairs to populate this 6 by 6 matrix. And we saw here the lower triangle, because these matrices are symmetric. The diagonal is also undefined. And every element is essentially time series, because we can repeat this process over time. So we get these matrices over time.
And the question one may ask is whether it can really decode orientations from my data? And the answer is yes. And we can see how, shortly after the presentation of the stimulus, the decoding time series go high up, close to 80%, 90% decoding accuracy.
And then, they go slower, lower. And then, there is enough of the response in the second vertical line. And then, they drop again.
We also answer different question here, whether the brain, the human brain, represents cardinal versus oblique stimuli in a different way. And it does. We see that cardinal pairs-- highlighted with a darker color here-- and cardinal pairs mean one of the grating orientations is horizontal or vertical, zero 90 degrees.
And in this case, the brain represents them stronger, because we can decode them higher than the other conditions. And we can do these for evoked responses. One can also apply the same procedure for different types of measurements, and in particular, applying time frequency decompositions to extract the spectral signals and focusing gamma rhythms. Because gamma rhythms are very strong signals, and used by these types of stimuli.
So the question would be whether gamma rhythms represent something from-- actually represent [? annex ?] form of orientation and information. And the answer is yes. And we can see here gamma rhythms, 50 to 58 when constructing pattern vectors like these. But representing 50 to 58 power, we can again decode using these feature vectors.
And we see representations that are actually quite stable. Decoding patterns are very stable over time, for the duration of the stimulus. So using analysis like this, without even having seen the data in fact-- I haven't even showed you the MEG data-- we already know a lot about the MEG data, how the information localizes in time, and what kind of information is contained in terms of cardinal oblique orientations, in this case.
And even more, this analysis I think is really fruitful, because now we can get this matrices and compare them against hypothesized models. In this case, we'll have a cardinal model and an angle disparity model. The cardinal model presumes that cardinal orientations are stronger, represented than oblique orientations.
The angle disparity model assumes that decoding is related to the difference in angle between pairs of stimuli. And we can get these models and correlate them, compare them with the representational dissimilarity matrix, or the decoding matrix of the MEG data time results. So for every time instance, we can do this comparison.
And the comparison is a simple correlation, Spearman's rho. It's a rank-based correlation. And we obtain this correlation time series. And let me draw your attention, for example, in this plot, which shows the gamma rhythms and how the responses in gamma correlate with different models.
And the angle disparity model is drifting along [? tons. ?] So it doesn't really carry any information in the MEG data. However, the cardinal model really explains the data for the duration of the presentation of the stimuli.
And not only that, but it's very close to noise ceiling. I will not explain what exactly is this noise ceiling, just to say that it represents the highest correlation that one can get, given the inherent correlation of our observations. Because we make MEG observations in different subjects, they also are subject to random noise, which limits the highest correlation that one can achieve.
In this case, the highest correlation one can achieve is more or less the same as the cardinal model. Thus, in this case, we almost fully explain what is contained in the gamma rhythms. And what is contained is representation of cardinal versus oblique orientations. So this is very powerful.
And last, I want to describe another approach, which is the temporal generalization of decoding. I did mention about training and testing, using different data to train and test the classifier and obtain decoding accuracy. We can play the same game.
But we train for a given time instance and test at different time instances. And the idea here is to study if brain representations are transient or sustained. If they are sustained, then training the classifier in one time instance and testing it in another time instance will still work. And we will still obtain high decoding accuracies.
So that's what we're testing here, and for evoked and the gamma responses. We see here for gamma responses that response is nearly a square with training and testing times. So no matter when we train and test the classifier, we get almost the same decoding accuracy, meaning that whatever information is contained in the measurements is really the same throughout this time.
So this is really a textbook sustained response in the MEG data, as opposed to the evoked responses, which are much more transient, certainly in the very initial part. It's really much more transient. So we also get an idea about the temporal dynamics and the temporal evolution of these representations.
And of course, there are more examples, such as the work from Leyla Isik back in 2013, to demonstrate that we can decode objects when their position varies. Thus, we have the emergence of invariance of processing position variance in the visual stream. Also, more recent work from Katharina Dobs, showing how we can decode different properties of faces, and some other seminal work from [? Radek ?] Cichy, showing how we can decode different aspects of stimuli, such as animacy, naturalness, and others.
And all these works originated from MIT, showing these different types of works we can do to study, make data, using decoding. And of course, there are a lot of different studies, as I mentioned as well, from here and elsewhere. All these bring, though, some conceptual issues that I want to discuss for the rest of my time.
One is, why do we want to use decoding? And the answer for that is actually simple, because these methods-- these multivariate pattern classification methods are powerful. They are robust.
So we want to use them instead of univariate tools. They also allow us to extract information in a straightforward way with the power of algorithms, instead of being lost in the complicated multidimensional patterns. And a more critical question is, what are we decoding?
And while we're decoding, the information we are capturing does not mean that it's actually truly used by the brain. And it should be careful with that. And what do I mean with this?
I think I can explain this easily with two examples, one of them being the 2006 Pittsburgh Brain Competition, which was an fMRI competition trying to decode events in videos. And the winning team admittedly did not have much experience in brain imaging. But still, they designed the best classifiers to decode the conditions the best.
However, their weight maps were something like this, containing most useful information from ventricles. And for us mere scientists, it's not hard to realize that this is not really understanding the brain, because we wouldn't be expecting any kind of meaningful activity in ventricles.
Rather, what happened in this case, is that when participants experienced numerous events-- they were laughing, moving, moving artifacts in ventricles-- we can decode, thus, success. But that was not really a success for understanding the brain, but rather for winning the competition.
Another very obvious example, I think, is looking at the patterns in early visual cortex. In which case, early visual cortex represents more like pixel-like information. And we all know that if we use convolutional neural networking, it can decode almost everything that humans understand. But it doesn't mean that early visual cortex is doing this kind of processing.
So if we plug a convolutional neural networking voxels in V1, we can potentially decode any kind of high level information. But we all know that high level information emerges later in the ventral stream. So whatever we decode from our tip, from our convolutional neural network, is really reflecting the complex computations performed by the network itself, rather than by the human brain.
So again, we should be skeptical when we decode. We may be decoding things that are visible by these very aggressive classification algorithms, rather than what is really happening in the brain. Another conceptual issue is the selection of the classifier.
Which classifier do you want to use? And in this case, there are nonlinear classifiers with design boundaries that are highly-- can be highly nonlinear-- though this one is not that much. But it can imagine any kind of nonlinear patterns that may not be realistic for what the brain is doing. But still, they may be powerful, because these classifiers will give you very high decoding accuracies.
Thus, these kind of classifiers, non-linear classifiers, would be desirable for real world applications such as, for example, detecting brain diseases, Alzheimer's disease. And if I am to do a little bit of self advertising, an example of such classifier would be a new architecture that we are developing-- I'm developing with colleagues-- called a graph convolutional network, which we provide brain networks that also relate with our upcoming talk, how we can construct connectivity maps.
In this case, connectivity was based in its own information. We have brain connectivity maps, feed them in these complex networks, and extract very nice classification labels which, in this case, rates up to 90%, much higher than previously achieved to classify mild cognitive impaired on controlled individuals. And that's great, as long as the aim is to obtain high classification accuracy. And in this application it is, because we want to detect patients very early, decades before clinical symptoms.
However, if we are to understand the brain, we don't want to do that. Instead, we want to be using linear classifiers, which restrict the solutions to linear boundaries. And why do we want to do that?
That's because there is a consensus, in the community, that the linear classifier can capture information that is explicitly represented in the brain. And by explicit, I mean it is amenable to biologically plausible readout in a single step. Or in other words, a single neuron like this neuron, when it receives this pattern, it can really apply this kind of classification procedure and make available this information to the brain. And thus, we want linear classifiers when our aim is to describe, to characterize, to interpret brain function.
Another conceptual issue is interpreting decoding accuracies. And it's desirable to obtain high decoding accuracy, but that's not always the case. But it doesn't mean they're not meaningful. When we want to compare decoding accuracies between different conditions, we actually want to be very careful to keep all decoding parameters constant, because the classification performance depends on so many underlying factors.
It depends on the selection of the classifier, on the cross-validation scheme, on how well are the two classes separated one another, on how many data samples we have and how many variables we will use, which variables we will select, and what is the underlying noise in the measurements-- whether we will do noise whitening-- all these results, all these factors will really influence our results. So if we are to say we can extract more information in this condition than the other one, we need to be very careful to control for all of these.
Another critical parameter in interpreting the decoding accuracies is that we should see these as information-based measures, rather than activation. In this case, let's assume, hypothetically, that we have a brain area that responds highly for objects, scenes, gratings, but very weakly on faces. In this case, this area represents everything other than faces, of course, because it is activated on everything else.
But at the same time, it is maximally informative about the presence of faces. So when we decode, we will really be able to decode the presence of faces over there. But it doesn't mean that we have activation for faces. It just means I do have the information.
And then, going back to activation measures, it's quite-- it's non-trivial. We need to be very careful in interpreting these and then looking at the underlying patterns. In the similar way, we cannot make directional inference such as activity in A is greater than B.
Rather, we can only say that it is different activity. For A is a different activity than B, because then we can decode there is information. And this may become, again, tricky when we are to compare activity across different otherwise observable populations, and so on. So I think it is good, when we decode, we isolate information.
But then, look at the underlying activity patterns to understand what really is happening in the brain. And also, it's tricky to interpret decoding weights. A common practice is to-- and I will jump a little bit. I re-organized these slides. But unfortunately, I'm using an old version. It's OK.
So it's often common practice to see these weight maps and then say, OK, so these different sensors have different weights. And thus, you can understand weight sensors are involved in decoding, contain information. And this, to some aspect, is true.
But all the sensors together are contributing to these. And it doesn't mean that individual sensors in isolation are really contributing to decoding accuracy. The way is to understand this is, for example, let's say that we have two sources.
The first source has signal plus noise. The second source has only noise. And this noise is very highly correlated with the previous noise.
The way to recover the signal is to take the difference between the two sources in this trivial case. So the optimal weights of a potential classifier would be 1 minus 1. Though the weights are equally strong, it doesn't mean that the second source was any strength signal.
It actually has only noise, but noise that is useful to remove the noise from other measurements and eventually, obtain high decoding accuracy. So you should be very skeptical interpreting these patterns. Instead, there was a brilliant paper from Haufe et al. in 2013, describing a method to convert these weight maps activation patterns.
And these ones can be interpreted as underlying sources. So this is highly recommended to do, when we interpret results. Not only essentials, but even in sources actually, find where to construct size maps in cortical sources. We still want to do this transformation.
Note, this transformation is only applicable for linear classifiers. But I already suggest that that's a good idea to use linear classifiers. So that's not an inherent issue. In fact, results from multiple groups, so that for this kind of data-- MEG data and other functional data-- we don't gain a lot by using nonlinear classifiers for this kind of problem.
And to conclude soon, another question is, what goes into the classifier? We can decode the individual stimuli, such as image A versus image B. We train and test across trials. So we keep some trials for training, some trials for testing.
And then, this is very powerful, because we obtain decoding time series, but also this representational dissimilarity matrix, which we can study for different information contained in stimuli. And I can skip that, so different decoding, different ways to do decoding, as well as to do condition decoding. In this case, let's say we have the contrast human faces versus human bodies.
So have different stimuli for faces and bodies, then for testing set, again, we use the same stimuli but different trials. A more flexible way to do that is to keep some stimuli for training and some stimuli for testing, doing a cross-exemplar validation. This is more powerful, because we saw that whatever the classifier learns generalized across different stimuli.
Even more generally, you can do cross-condition decoding, let's say for example, in train for again for human faces and human bodies. And now, we can train for animal faces and animal bodies. So if we manage to do this kind of training and testing, then we saw that this face and bodies information really transcends the humans and it covers different species.
And finally, I mentioned as well, we can do training in one time and testing another time for a cross-time, cross-validation scheme. And this one describes a sustained versus transcendent presentations. This has been applied in several works in different aspects, getting these temporal generalization maps.
And I will conclude here. So we need to be careful of these issues, when we apply this decoding techniques. But otherwise, they're really powerful to interpret brain function.