23 - Group Analysis - Hands-On: Part 2 of 2
January 28, 2019
May 31, 2018
Gang Chen, NIMH
All Captioned Videos AFNI Training Bootcamp
Gang Chen, NIMH
For more information and course materials, please visit the workshop website:
We recommend viewing the videos at 1920 x 1080 (Full HD) resolution for the best experience. A lot of code and type is displayed that may be hard to read otherwise.
GANG CHEN: So the next example I'm going to show you is the Bayesian approach I talked about yesterday. You don't have the data available on your computer, but so you have to watch. So let's briefly review what we talked about, the example we showed yesterday using ROI-based approach to do group analysis.
So remember this case, we have 124 subjects. It's resting state data. So for each subject, we performed theta-based correlation. So we have a theta region, and we calculate the correlation between the set and the rest of the brain. So that's for each subject. So then we have the z-score for each subject, three-dimensional data set.
So at the group level, the analysis, the model is pretty simple. We just have one exploratory variable, which is the theory of mind index from each subject. So the model is, if we just do the whole brain analysis, each voxel would be y equals at a plus bx plus the residual, right?
So when we-- if we stick to the idea of the traditional approach, usually we can only manage to have two clusters to survive the-- why were the cluster-based multiple testing correction. So that's little bit-- I mean, at most four clusters, depending on what the voxel wise p-value we want is set as a primary threshold. So two to four clusters, that's we can manage to get.
So how can we do it better? So instead of do whole brain analysis, we focus on 21 ROIs. So the model, here we are doing is that we do partial pruning. That means we put the ROI as a variable, and also we assume the ROIs, those intercepts, and the slopes, are not-- they are not independent of each other. Instead, they are bound by a loose assumption, which is the Gaussian distribution, among those 21 ROIs So that's the prior information we apply to those 21 ROIs.
So that's the pruning part, or some people call the shrinkage effect. So that's the crucial part, the pivotal part that we apply to this 21 ROI-based approach. So now let's look at the exactly how we do it. You don't have the data, so you'll basically watch me how to create a script to run.
So you just-- in this case, you just need to create a table as pretty much like the 3dMVM or 3dLME table. Data table, the first column is the subject labels. So in this case, the 21 subjects.
Each subject-- remember, there are 21 ROIs. So 124 subjects, each subject you have 21 ROIs. So we would have-- in this table, we would have 124 times 21. That's the total number of rows. So we need to create that. So that's-- I mean, there are many ways to do it. You can use spreadsheet to do that.
So those are the first two columns, are the subject labels and the ROI labels. The third column is the y, is the response variable, is the z-score. Remember, this is the theta-based correlation value. It's converted to z-score.
The first column is the theory of mind index. That's the covariate. There are three extra y's. We're not going to use it. We just use the fourth column, is the total. That's the total index for theory of mind measurement. So the other three, we just ignore in this case. But we can leave it there.
So that's the data structure we have. So we're going to focus on the first four columns. Then let me show the script. Is it here? So that's the script.
So the program name is already available. It's even on your computer, you can see the help, actually. It's Bayesian-- let me-- you can type it, too. BayesianGroupAna.py-- dot P-Y. You can check it then type help. You can pipe it-- I mean, as I show here. I mean, the default is always dash help. You don't have to type that part.
If you type that, you're going to say to help, the typical AFNI program help. So show what options you have. In the end, probably there-- I mean, for example, how to create a data table, what options you have. Then in the end, probably hopefully there is some example script.
Yeah, it has a have one here. See this? That's an example of a script. So you can use that as a template to create your own script. So this is a script that we are going to try here. So in this script, first we have the program name. It's called BayesianGroupAna.py.
Then option data table. Then they put the file name after that table we just saw. That's the first line.
Second line is the y. y is the response variable. It's the z-score, right? That's the variable name, the column name in the data table. The x, there's only one. In this case, we consider the total theory of mind index. So that's the fourth column in the data table.
So the next few lines. So the next one is the prefix. Prefix is just the output file name. Then there are those, the next three options are a little bit esoteric, right? So most people probably not familiar with. Remember, this is we're doing Bayesian modeling.
Doing Bayesian modeling, it's to some extent it's dramatically different from the conventional statistics in many ways. So there is conceptually some differences. And also, there is-- from the modeling part. The modeling part is-- usually, mostly we cannot analytically solve it, unlike, for example, regression. So we just use the ordinary least squares. You just calculate it.
But for Bayesian, there's no-- unless there is a special, special, special case, you can still get the analytical solution, otherwise you have the wrong multiple chains, the concept of multiple chains, the Monte Carlo simulations, Markov Chain Monte Carlo simulations-- MCMC.
So we need the multiple chains. I mean, I don't want to get into down to the details, but each chain will go through the Markov Chain Monte Carlo simulations. Then in the end, we just pull the data then plot it out-- I mean, get the posterior distribution.
There is no such a thing called the p-value in Bayesian land, because whenever people talk of p-value, that's always associated with the concept of null hypothesis. You pretend nothing is going on. Then you take that, you do witch hunting.
So in Bayesian land, there is no p-value in the end. Well, we can talk about probability, but that probability is not a conditional probability. You don't need to pretend anything. You just have-- already have the distribution about your parameter. You can say something, the p-value would be-- I mean, you can say, what the-- main effect is greater than 0.5, what's the probability? You can talk about that.
So that's called a posterior distribution. So why it's called a posterior? Because we have the prior information. In this case, the prior information is to assume those 21 ROIs, they follow some sort of a Gaussian distribution. So that's our assumption.
So that's why in this case, we'll need four chains-- at least four-- then some number of iterations. In this case, with each chain we go through 1,000 iterations. So the last line is the R data. It means once we run the Markov Chain Monte Carlo simulations, we save the result.
I mean, we don't have to. But in case we want to do something more-- I mean, once we have the data, then we can do, for example, model checking-- so model validation and cross-validation, for example, so on and so forth. We can do many things, actually. But you don't have to save that unless you are interested.
So that's basically the simple layout of this script. So in the script, you just type tcsh run then you type that. On your computer you cannot do it, because I already have the output file name there. So that's why it's not letting me remove that file to avoid that conflict.
Let me call it 2. So now then go back to run it. So now it basically rates the data table, giving us some summary about the data information. The notes up there are packages then read in the table.
Now notice at the bottom, it says now it's compiling to C++ code. So it's reading-- the interface is written in R, but the core part of this Bayesian modeling approach is in C++. It's called the Stan-- S-T-A-N. The specific program is called the Stan-- S-T-A-N.
That's basically, it's Stanilov. I think he was one of the team members who did the hydrogen bomb in 1940-- '40-something. '47, probably around that period of time after World War II. So he was one of the team members.
Basically, he did a significant amount of Monte Carlo simulations. That's why this C++ package is-- I mean, it's named after him. So now it's complaining to the C++ code.
Once the compiling is finished, this is going to go through those 1,000 iterations. Four chains-- unfortunately, on my laptop I only have two CPUs. So I cannot fully parallelize the process. So ideally, it will be on some laptop-- I mean, desktop, at least four CPUs.
So once it's finishing compiling to the C++ code, now it's running those chains. So now it's doing warm-up, then later on data sampling stuff. So that's if you're familiar with the Bayesian computation, then that's pretty much the terminologies for people to describe.
So they give us some rough runtime. But that rough runtime is not accurate. So in this case, usually it would take-- on the desktop it will take five minutes. On my laptop, it worked on 10 minutes, because I only have two CPUs. That's why, so.
I'm not going to do it, because I already have the results. So let me kill it. So since I already have the result-- so once you run it, you get a bunch of output. But the main output is this table.
That table-- basically, remember, we're doing ROI-based analysis. In this case, we're interested in the slope at each ROI. The slope effect, basically, it's the-- the behavior data is the theory of mind effect. So that's called the total. The variable name is called total.
So each ROI will have one row. That row shows the mean, the standard deviation, and the quantile intervals. It's like used similar to the conventional study. So you've got a confidence interval.
But here in the Bayesian world, it's because not just the terminology is different. And also, what it means is also slightly different. But it's similar. So it's called quantile interval. The quantile interval you can convert through some probability.
So that table, those rows, it's pretty much like I showed yesterday. It's pretty much this table. So you can directly create a table when you publish the result. So the nice thing about this is-- I keep emphasizing that-- I don't do cherrypicking. I don't [INAUDIBLE] write the paper.
I'd only say whatever I focus on, for example, eight ROIs. That's what I would command it to do. And we command it to represent everything there.
So with that output, then you can focus on a few ROIs you feel you are more confident to say I have strong evidence to show they are theory of mind effect for those ROIs. But that doesn't mean you need to hide the whole result, the other ROIs.
So that's crucial for general, for science reporting, results reporting. Because the reason is that, even some of those-- most of ROIs don't have strong evidence. Using this traditional studies, you would see they are statistically insignificant. But that does not necessarily mean there is no activation among those ROIs. Simply just because the evidence is weak doesn't mean there's nothing going on there.
So when we report these results, later on, if the knowledge keep accumulating, people can still use this whole data set to do meta analysis. So next paper, next study-- and even though for this, our current study, one particular ROI is not-- the evidence is weak, but another study may be stronger. People can still use this data as part of the meta analysis. They can compare different studies.
So that's-- we keep emphasizing-- I mean, emphasize the importance of reporting the whole results. So this is the example demonstrating the Bayesian approach. So this is pretty simple one. So there you could have multiple experimental variables. Remember, there are other few indexes, too. You can put in a more sophisticated model.
So the program is available. It's in the distribution. So your computer, it should have it. So in case you want to try out the approach, you can-- basically, you just create that table, then write the script, then let it run.
So it depends on the number ROIs. It may take longer. I mean, the longest so far I have tried-- not for this kind of analysis. I tried, for example, the graph theory. You have the correlation matrix from each subject. I also tried the intersubject correlation.
So we have probably close to 300 ROIs. That's the part of the most sophisticated scenario-- I mean, amount of data. So the longest I have tried is three weeks. So that's-- it is, I mean, burdening-- I mean, burdensome, in terms of computational cost.
But the results, I mean, are really nice. I mean, that's the rewarding part. It shows more detection power than the alternative traditional approach. So that's the Bayesian approach.
Now let's move to the next one, the case three. So in this case, it's a multiway ANOVA. Let me say. Oh, we have-- it's a four-way ANOVA. So first, we have one between subject variable is a group-- three groups-- I mean, two groups. The young group, 15 subjects. The older group, I have 14. So let's go to that direction first, actually.
Let's go here. Go back one directory up then the third one. So in that directory, I already created the 3dMVM script. So let's run it, actually, before we talk about the detail. So tcsh MVM-- So on your laptop, you can try to see if it works.
So on the interface, I mean, once you type-- once you start running the script, so you can see the interface gives you multiple tables. That may sometimes help you to see if there is any problem with that. So for example, the data structure gives you some of the contingency tables.
So once it loads up, then keep running. So it's done. Well, it's quick. That's because this is not a whole brain. I believe it's just probably 3 by 3 by 3 voxels.
But let's go back to the slide here. So remember, this is a four-way ANOVA. There are four factors. The first factor is between subject factor, there are two groups.
Then there are three within-subject factors. Some people call it repeated measures factors. So there is a task, there are syllable, there is a sequence. So this is a linguistic study.
So there are three repeated measures factors. Each of those three factors has two levels. So for example, task factor, there is a perception versus production. Those are the two levels. syllable, it be simple syllable or complex syllable. Sequence, also there is a simple and complex.
So this is a four-way ANOVA, right? So it's a little bit challenging. But once we conceptualize that as a ANOVA, so it's not that hard. So in 3dMVM, let's show the script.
So at the top, within that script it describes the data structure. At the bottom again, it's that data table, the typical 3dMVM data table. So the first column, you list all the subject labels. The last column is the input files.
You may notice the input files are a little bit strange. Look at the file names. So they are not three-dimensional data set. It's a surface data, in [INAUDIBLE] format, which either Rick or Daniel will talk about this afternoon about the SUMA.
So this is a surface data. So that's why they're not nodes. But anyway, so in AFNI, pretty much all programs take both three-dimensional data set, or it can be in AFNI format or NIfTI format-- dot N-I-I. So that's the NIfTI format, which is a shared by all software packages.
It can also take surface data. So in this case, it's surface data. It can also take one-dimensional-- I mean, column datas like that's called 1D. So it's just pure text with numbers in columns. So AFNI can-- programs can take those different formats as input, so in this case, the surface data.
So between the first column and the last column are those experimental variables. In this case, remember, it's a four-way ANOVA. So that's why we have four columns in between. So we will have a group-- it's a young versus old; then there is a condition-- house versus face; then there's the image-- simple versus complex; clarity, also simple versus complex. So those are the four factors.
It worked. So I don't know if this probably will-- I don't know if it works, the surface data. Let me try it. I don't have any data, so it doesn't work. Because it's surface, I probably have to do the-- use the SUMA to load it up. But this, since we haven't talked about the SUMA yet, so let's forget about it.
All right. Any questions about this script, this 3dMVM script? Oh, actually i forgot about-- talk about the post-hoc test here.
So I mean, the program-- let's just talk about the script, because it always helps to refresh your memory about the model specification. So in this case, 3dMVM and prefix-- that's, as you [AUDIO OUT] the output file name. Jobs, we using tool processors.
3dMVM, we have two options to specify the variable types. So the BS is the Between Subject variables-- only have one, which is the group young versus old.
Then the next line, [INAUDIBLE] vars. That's within-subject variables. We have three-- three repeated measures factors. So we have a condition-- that's house versus face; image-- simple versus complex; clarity-- simple versus complex. So that's how we specify those three within subject [AUDIO OUT]
You may notice, I put a star between those factors. So that means we want all the main effects plus the all possible interactions among those variables. OK. So star means-- a star b means a plus b plus a colon b. Plus means we want the interaction, colon means interaction.
So the next lines-- nine lines are the post-hoc tests. So then we just use a coding with the label. So use the-- for example, the first [AUDIO OUT] we compare those two groups [INAUDIBLE] to find the weights.
Unlike the other software packages that you've-- when they do a sort of-- they do this kind of a general linear model, you have to put your dummy code in 0's or 1's or minus 1 in the columns. Here, we don't need to do that. The problem automatic creates the design matrix. For us, we only need to use this coding, use the weights, use the factor levels to specify those tests we want.
The fourth one, it's relatively simple, but there is one complexity. So there's two variables. One is that within-subject [AUDIO OUT] condition-- the house on the face. That's one factor. It's a repeated measure factor.
Then there's a quantitative covariate. That covariate is reaction time. In this case, you may call the ANCOVA, but the problem is the covariate, the reaction times, it's within-subject covariate. What do I mean, the covariate is within subject or repeated measure?
Remember, we have two conditions-- house and the face. So that means each condition has a reaction value. So that means each subject has two reaction time numbers.
Unlike between-subject covariate, you only have each subject has one. Like age, you have each subject is the number of years. So that's just one number.
But here, the reaction time, because we only have two conditions, that's why we have two reaction time values attached to each subject. So that's the complexity. That's also the challenge in how to handle it.
So that's why the traditional ANCOVA would not work. With traditional ANCOVA, you can only have between-subject covariate. So that's why we use the 3dLME using linear mixed effect modeling approach.
All right. Let's go to that directory, one way up, go to the number four-- random slope. So let's [AUDIO OUT] let's read this script-- I mean, discuss the script. So again, at the bottom is the data table. So that's-- at this point, I mean, you should feel comfortable about understanding the table.
So the first column is the subject labels. The last column is the input files. In between are those experimental variables. So in this case, we have two. We have the within-subject factor; the condition-- house versus face; the reaction time is just numbers.
I already centered them, because I want to center them within each condition. I already centered them. So that's why you see those numbers, some positives and negatives. Of course, the real reaction time is always positive. You don't-- well, I mean, presumably, you should not [AUDIO OUT] values. Because I centered, that's why you see positive and negative values.
Now let's look at the model specification. So the first line here, it's the 3dLME prefix. So I want to output file name-- output file name, then I will have [INAUDIBLE] use on my laptop. So the next line, that's the model specification.
So in the case I want condition by reaction time. So condition star RT. That means we want both main effects under the interaction. So what does it mean? What does it mean by interaction between the factor and the covariate?
That means we want to see the different reaction time effect between the two conditions. So that's the concept of interaction in [INAUDIBLE] So that's the model.
So in the output, we should have at least three F-tests-- two for the two interactions, one for the the condition difference-- I mean, the difference of reaction time effect within two conditions. So that's the model.
The next line, will you tell the program using the [INAUDIBLE] to tell the program reaction time is a quantitative variable? If you don't do that, the program will interpret the RT as categorical It would be considered as a factor. That would totally ruin the model, because those numbers are not factors.
So the next line, [INAUDIBLE] centers. So we need to tell the program what kind of a centering value. In this case, I already centered those values before I put it into the table. So that's why I put a 0 there to expressly say, don't do more centering. If you do, just center around 0. That's essentially equivalent to no centering.
The next line, the random effects, this case it's a little bit different. That's because reaction time is within subject. It's a repeated measure covariate. So tilt 1 plus RT. That means each subject has the unique reaction time across the two conditions.
So we want the both intercept-- random intercept and random slope. So if you're familiar with the linear mixed effects modeling, you know that what it means by the intercept and the random slope.
So the next line is we specify the sum of squares type. We typically use type three. Then we specify three [AUDIO OUT] post-hoc tests.
So for example, the first one, we want to compare the two conditions, house versus face. So we have the label, first house minus face, then we'll have the code, the symbolic coding. So they use the variable name first. Don't forget to put a space then colon space.
After that, we'll put the weights in front of a-- a weight for in front of each factor level. So that's what the 1 star house space minus 1 star face. So that's-- we want to compare the two factor levels.
So the next one is the face minus house. I mean, we don't have to do this, because the previous one is exactly the same. You just flip it at the location-- position of the two factor levels.
The last one, the third test is a little bit unique, because we are looking for the reaction time effect for-- I mean, reaction time effect difference between the two conditions. So we want to compare face versus house in terms of reaction time effect. So that's why the last part, Reaction Time, RT, space colon, that part, nothing after the colon, that means we just want the reaction time effect.
All right. So that's the script. Let's see if this works. I mean, you can try it on your computer. So type tcsh and the script. Probably it will complain because maybe I already have the-- yeah. So maybe we move the file.
Let's see. Now it should work. See, it loads up the data then shows us the contingency table. Now it starts to run. All right. So it takes a few seconds.
So this time, think the result-- maybe we can load up the data. So it's a 3 by 3 by 3. That's why it was quick. So it's only 27 voxels. So let me see, if I load the overlay-- all right.
So here it shows-- I don't know if on the screen-- for me it's on my laptop. it's difficult to see there. So [INAUDIBLE] four F-statistics. There are two main effects-- one for the condition, one for reaction time then the interaction. And also, there's a F-statistic for the intercept.
The intercept in this case, it's-- basically, it's when the reaction time is at the average, at the center value. Then the two conditions combined. Basically, that's what the intercept is about.
Suppose that we look at the reaction time effect. So we select that. Oh, actually maybe the post-hoc test is more interesting. House versus face, the contrast.
I mean, there are 27 voxels. There's not much to look at here. But basically, I only just use this to show what kind of output we have.
So one more thing I may want to show you is that you can always use the 3D info to check the file. Use verb. Then in this case, the output, should check the output file. So notice, once your 3D info, verbose then that file name, that on terminal will display the data.
So you can do it similarly on the AFNI GUI. You don't have to type on the terminal. For example, here, I believe there is a-- on the click Define Data Mode-- let's see, click Misc-- Miscellaneous.
Then there's a Overlay, Underlay Info. In this case, it's overlay. So if you-- oh, it is here, just at this small window. It is here. So it looks like there are multiple there hiding. [INAUDIBLE] it did multiple times.
So there, what I showed you on the terminal, it's also available on the GUI. So you don't have to type. So on the GUI you can follow the steps. So [INAUDIBLE] Define Data Mode, there is a Misc. There's a couple of others. The Misc, click Misc. In this case, it's the Overlay Info. Then you have this pop-up window. That basically shows exactly what I showed-- I mean, I did on the terminal. So same thing.
So the reason I want to show you this-- I mean, you can use this to extract multiple pieces of informations. For example, about the voxel size, here it is-- voxel size is 3 by 3 by 3. So here. So 3 millimeters. That's the voxel size. You can use that to extract. You can also look at what kind of a left and right-- what kind of a [INAUDIBLE] directions are AI or AOPI, for example.
Also, for each [INAUDIBLE] you can use this information panel to find out something. The reason I want to bring it up-- let's go to terminal, because it's easier to see. The reason I want to show you this, if you [INAUDIBLE] result, sometimes you want to-- you have t-statistic, F-statistic. You need not just to see the value itself-- you also need to report the degrees of freedom, right?
Sometimes, if you [INAUDIBLE] paper, people will put a t-value but they never will mention the degrees of freedom. Without knowing the degrees of freedom, the t-value itself is just meaningless. So that shows how sophisticated their knowledge is about the statistics.
So in this case, for example, the condition-- remember two conditions-- the main effect of condition F-statistic-- if you want the degrees of freedom, it's listed here, statistical parameters. So those two numbers-- for F-stats, we have two numbers for degrees of freedom. One is for the numerator, one is for the denominator.
So this F, it's essentially the t-stat, because the numerator degrees of freedom is 1. So essentially, it's a t-stat. But this, it is presented as F. So that's why there are two numbers for the degrees of freedom. All right. So that's-- down below, that's that. If it's a t-stat, you will see that degrees of freedom as well.
Another information here is that it shows the range. So in this case, see for the F-stats, it goes from 1 to 34.5-something. So it tells us about-- basically, gives us some summary about this data-- about this output data set.
Any questions? Another thing is, at the bottom of the 3D info, this 3D info basically extracts the header file of the output. So notice, at the bottom it shows the history-- specifically in this case, shows a copy of your script. That's very nice.
So even though you may-- for some reason, in last few, it was 3D [AUDIO OUT] script, but don't worry. It's saved in the output file in the header. You can basically use that to run it again or to modify it. So that's-- it keeps a record of--
AUDIENCE: With 3D copy, you keep the header or then it restarts. I remember sometimes it would [AUDIO OUT] an image, and the information would disappear.
GANG CHEN: What information? The history?
GANG CHEN: Really? Let's save--
AUDIENCE: [INAUDIBLE] 3D copy or [INAUDIBLE] a copy. But somehow, it [INAUDIBLE].
GANG CHEN: OK. Next time it happens, email us. I mean, email us what's-- exactly what's going on. So we need to know exactly which program is causing that trouble. So you're not so sure if it's a copy or something else.
AUDIENCE: No, not so sure.
GANG CHEN: OK. Tell us next time if that happens, OK?