Cell-Type Specific Transcriptomics
November 22, 2022
November 21, 2022
Sebastian Pineda, MIT
All Captioned Videos Computational Tutorials
Tutorial on transcriptomic assays - TRAP and snRNA-seq sequencing with Sebastian Pineda
High-throughput sequencing assays have become ubiquitous and indispensable tools in molecular neurobiology. They provide a means to investigate gene expression, dissect gene interactions and pathways, and examine how these change in model systems and disease. Recent developments in transcriptomics have yielded methods that allow us to survey gene expression with cell-type specificity at varying degrees of resolution. In this tutorial, I will give a brief overview of two common cell-type specific transcriptomic assays: TRAP (translating ribosome affinity purification) and snRNA-seq (single-nucleus RNA) sequencing. I will introduce the types of data that these assays generate and discuss the advantages and use cases of each. The bulk of the tutorial will focus on a typical data preprocessing and analysis workflow covering the topics dimensionality reduction, count normalization, clustering, and cell type annotation, as well as key downstream analysis concepts such as differential gene expression and pathway enrichment. Lastly, I will showcase recently developed tools and software packages that facilitate more sophisticated methods such as batch correction, data imputation, and network construction to simplify and enhance large-scale transcriptomics data analysis.
PRESENTER 1: Sebastian today will be presenting on transcriptomic assays. And you work with Myriam Heiman, right?
SEBASTIAN PINEDA: Yes.
PRESENTER 1: Yes. What year are you in?
SEBASTIAN PINEDA: I'm a fifth year PhD student.
PRESENTER 1: OK, very cool. Yeah, thanks so much for agreeing to present today.
SEBASTIAN PINEDA: I'm Sebastian. I'm a fifth year PhD student in ECS. I work in Myriam Heiman's lab. And I'm also co-advise of Manolis Kellis at CSAIL. Our labs do-- well, mainly, we study neurodegeneration in human and mouse. And we apply a lot of omics technologies for work, some of which we have developed, some of which we have adopted.
I'll talk about a few of those today. We work both on the experimental and computational aspects of this. So our lab does a lot of both methods assay development and also a lot of computational tool development for the analysis of these types of data.
So I want to talk about cell type specific transcriptomics, specifically focusing on RNA sequencing modalities that are used to target specific cell types and give us either cell level or cell type level resolution on gene expression, and talk about how we analyze that data, what we use it for.
So I don't know what the level of familiarity is everyone here with-- I can run through this, if most people are already familiar with the various omics technologies and how we use them. They're pretty ubiquitous these days in neurobiology.
We study gene expression in most labs, use gene expression assays, especially when they're working on studying models of disease. So I'll just give a quick introduction here. So there's different types of omics technologies used in biology that allow us to gain insight into different levels of abstraction.
So genomics is concerned with the study of the genome, specifically the genome sequence, its function, its variation. We have epigenomics, which focuses on the structure and modifications to the genome. The transcriptome, which-- and all, of course, are sort of in order of the flow of information according to the central dogma, right?
So the transcriptomics focuses on study of gene expression, specifically the study of RNAs. Proteomics studies protein function, protein interactions. And metabolomics is the biochemical activity that is that those proteins permit.
So these are just a list of different technologies and applications that they're used for. So the main one we're going to talk about is transcriptomics, specifically sequencing. It's the very popular, very high throughput way of studying transcriptomics. And what's nice about this is that you can do an assay where you study the cell biology at some level of abstraction that gives you insight of what's happening in another level.
So you don't have to measure things directly, right? We can do whole genome sequencing or we can do a SNP array and see that we have some deletion somewhere that we know is in an open reading frame. It's going to lead to some of misfolded protein. And we don't have to do the proteomics assay to know that we're going to get these aberrant proteins.
So as we go down this ladder, we get closer to the things that drive our observed phenotype. Right? So if we're studying disease and we're talking about neurodegeneration here, we're focusing, for example, in our lab studies Huntington's and ALS. And these are things that-- Huntington's is monogenic, so you have a mutation in the Huntington gene.
And that's what drives everything downstream of that. But if you have functional or genetic variation elsewhere, it might not actually mean anything, right? You have a lot of the genome that either has no known function or has literally no function. Whereas if we go down this ladder, the things that are happening at the metabolomics level is the biochemical activity that directly drives our observable phenotype.
So it's inconvenient that as we go down this ladder, things get more difficult. I guess mainly in the context of sequencing technologies. As a general rule. Sequencing nucleic acids is very, very easy. Profiling proteins and small molecules is very, very difficult.
And this poses a challenge for us, because at least in neurodegeneration, we would like to see what's happening at the protein and at the metabolite level. And we can't really do that with very high resolution, very high throughput. So the closest proxy we have is transcriptomics.
So we can look at gene expression changes. And that tells us what's happening at the protein level, at the functional level. This is not always the case. But it's as good as we have, right? So the most-- well, about 30 years ago, we obtained the ability to assess gene expression changes transcriptome-wide with the invention of the microarray.
And it's a very good technology. It's still used, to some extent. Although the drawback is that you need to know what you're looking for. You need to have an array where you get probes that capture specific transcripts. And if you don't know what those transcripts are, you're not going to capture them.
So when next generation sequencing came about-- which was several generations ago-- we now have the ability to do transcriptome-wide profiling without any sort of a priori knowledge. We can capture all the RNAs in a sample. We can construct sequencing libraries. We can align them. We can quantify them. We don't need to know anything about them.
And the general workflow is we take some samples, some piece of tissue. We lyse the cells. They spell out all their RNAs. We generate cDNA libraries. We can sequence RNA directly. It's not terribly efficient. Generally what we do is we generate complementary DNA. We then fragment that.
We can sequence those doing very high resolution short read sequencing. And then we align these reads to the genome. They tell us what genes that these reads map to and how many of them there are. So there's many different types of RNA sequencing methods that capture different things.
We can capture things that specifically, for example, polyadenylated transcripts, RNAs, tRNAs, microRNAs, things that only encode actively transcribed genes. So there's many different protocols. There's also many different sequencing approaches.
The most popular one is just doing very high fidelity short read single end or paired end sequencing, which we're going to talk about. But there's also ways of doing full length sequencing or base pair resolution sequencing. All of these have their advantages and disadvantages.
The more resolution you get at the nucleotide level, the lower your throughput is going to be. So the popular approach is just to sequence the ends of these transcripts, which you can do. You can sequence the whole transcriptome and get a quantification of what your gene expression profile looks like. But you sacrifice information, like splicing or sequence variations that might not be present at the 3 prime or the 5 prime end.
But this is very cheap and easy to do. And this is enough to tell us what-- at least in disease, the interesting question of what genes are changing, how they're changing, what changes are driving other changes, for example. So this picture I hope illustrates that the brain is a very complex organ, much more so than other organs.
If you look at, for example, the liver, you can take a little chunk of liver, and 99% of the cells in that little chunk of liver are going to be hepatocytes. This does not hold true for any part of the brain. The brain is extremely heterogeneous. Dozens of cell types at any given location. They're all very different. They all look very different.
And if you look at a-- I mean, you can do like a Newman's stain that just gives you an image of the outline of all the various neurons in a certain region of the cortex. And you're going to see many different shapes and sizes. But even the ones that look the same are still different. They might have different projections. They might have different developmental trajectories, different functions.
So this is not something that we can just grab a big chunk, extract all the RNA, and then get an accurate picture of what's happening in the brain, especially when we talk about diseases where only specific cell types are affected. We get misleading results or we get no results, because that cell type that we're looking at might be kind of rare.
So this is an analogy that I often see used, the smoothie versus the bowl of fruit. Bulk technologies, which is what we've been using since the mid '90s, basically give us a picture of the average gene expression changes across our sample or everything that we're profiling, whereas as cell type specific allow us to dissect the various components out.
So if I were to give you a strawberry-blueberry smoothie in a gallon container and tell you that I put one grape in there, you probably wouldn't be able to taste that one grape. But that one grape might be very important. You might be allergic to grapes.
So an example of this, for example-- an example of this is the cerebellum and spinocerebellar ataxia. So the cerebellum is a very dense region-- the most dense region of the brain. It contains half of all the cells in the brain, about 50 billion of them. Most of them are these tiny little cells called granule cells.
And there's this family of diseases known as of cerebellar ataxias that famously deplete one very particular cell type called the Purkinje cells, which are very important because they have these massive dendritic arbors they connect to hundreds of thousands of other cells.
And these Purkinje make up less than 0.01% of all the cells in the cerebellum. So if I-- and people have done this. If you profile the cerebellum at the bulk level, you're not going to see anything from these cell types. The signal's going to be completely drowned out. And even with very targeted methods, including the ones that we can talk about, this is still very, very difficult.
So we're going to talk about two cell type specific sequencing modalities which are very common in the field. And they sort of encompass the two extremes of resolution. So one of them is TRAP, Translating Ribosome Affinity Purification, which allows us to target one very particular cell population with high accuracy and very high fidelity.
And another one, which you've probably heard of because it's become very popular in neurobiology, which is single nucleus RNA sequencing. And this one gives you a very global view of everything that's happening in your sample. And it can recover any sufficiently abundant cell type and give you cell level resolution.
So these are examples of breadth versus depth. One is very targeted, very, very high fidelity. The other one gives you a very broad profile and it's maybe not quite as accurate. And the best way to use these technologies is actually in a complementary fashion. If you can manage to combine the two, you can get a lot of very useful insight.
So before TRAP, the way we used to do any type of single cell omics is we had to dissociate and flow our samples. So we'd run them through something like a flow cytometer and we would basically put one-- we'd take like a 96-well or 384-well plate and we'd put one cell in each well.
And then we'd make a sequencing library out of each individual cell. And that was, of course, extremely expensive, extremely high throughput. But on top of that, you also needed to find a good molecular marker that could identify and isolate a very specific cell type. And that's very difficult to do. And it's still kind of difficult to do.
So TRAP was a game changer, because it was the first method back in 2008 that allowed for really accurate cell type specific targeting and isolating and sequencing the transcriptome of one very particularly well-defined cell population.
So the way TRAP works is you have some transgenic animal, like a mouse, and you introduce a transgene containing a EGFP-L10a fusion protein. So what this encodes-- and this is driven by a cell type specific promoter. So if you know, for example, as in the original paper, we want to capture the D2 spiny projection neurons of the striatum, we know that that cell type more or less quite specifically expresses Drd2.
It's dopamine surface receptor. And ideally, it's not expressed in any other cell type. So this transgene would be driven by that promoter. And only on that cell type you have the transcription translation of this ribosomal large subunit protein that has a GFP bound to it.
So you would target these specific cell populations. And only those would have these GFP tagged ribosomes. And then you can lyse the cells and do an affinity purification. So you have these magnetic beads that have an EGFP antibody. And the beads would capture these GFP tagged ribosomes. You would remove everything else. And then you would pull the RNAs off of those ribosomes and sequence them.
And then you would make a cDNA library and yeah. So what was nice about this is that it allowed for very high fidelity cell type specific RNA sequencing. And what you would selectively recover are these actively translated protein coding RNAs. So everything that you recovered is something that was going to be used to make protein.
And because all of these were GFP tagged, you could actually stick them under a microscope and you can validate beforehand that your experiment worked. You know, I'm targeting striatum. Only the striatum should light up. Right? You're not wasting time sequencing the wrong thing.
Furthermore, because you can actually introduce the transgene in many different ways, you're not restricted to just recovery of a specific cell population by molecular marker. You can, for example, put the transgene into a retroviral vector and ask what cells are projecting here.
And then, for example, we actually did this experiment where we injected-- we did a retro [? traffic ?] [INAUDIBLE] where we injected the striatum. And we actually were able to label by projection cells that were projecting to the striatum. And then we can just dissect this chunk of cortex. And now we have this cell type specific profile for cortical cells that make projections to the striatum.
So you can get pretty creative with this. When this came out, everybody was using it. And there's now many different variations you might have heard. For example, Ribo-Seq, which used not GFP, but they used other things that might have better specificity. But you can't image them.
So just like the single nucleus RNA seq that we'll talk about, all of these, they get popular and people come up with 10 different ways to apply them. So when do we use TRAP? When do we not use TRAP? So as I mentioned, it's extremely high resolution, extremely high detection sensitivity. Very low noise, very low background, which is unique among transcriptomics technologies, because most of them are actually quite noisy.
We'll talk about an example of that. We can tag and recover cell populations by properties other than just cell type specific markers, which is unique, in a sense. We can image these samples and this also is a pro or a con, depending on how you look at it.
We get a very pure population of actively translated protein coding genes. If you're interested in things like long non-coding RNAs, microRNAs, then this assay might not work for you. But in our field, we're very interested in studying gene expression, and specifically things that encode the aberrant proteins that are causing our phenotype. So this is pretty ideal for us.
And this can actually be used for more than just transcriptomics. So we, for example, have used this for what we call a TRAP-ATAC, which we can recover chromatin accessibility profiles from DNA using these same labeled cell populations. So there's a lot of flexibility here.
So what are some cases where people would not want to use them, right? So because this requires a transgenic specimen, you can't use this on human. And that's a big drawback, because we would love to get cell type specific transcriptomes from human disease post-mortem samples.
It's quite low throughput. As I mentioned, you can only target one cell type at a time. And if you don't know what cell type you're looking for, then this poses a challenge. Because then you either have to do many such experiments or-- well, there's better ways of approaching this.
If you are looking for cell types by a particular marker expression, this requires a priori knowledge of very high specificity markers. So in the case of these Drd2 neurons that I just talked about-- so Drd2 is actually expressed at a low level in astrocytes as well.
So without knowing that, you're going to capture a population of mostly Drd2. But you're going to have contamination from astroglial cells as well. And of course, because you're not recovering individual cells, you're recovering the transcriptomes of all the cells that express this gene, you're getting the average gene expression profile.
So if there is heterogeneity within your cell population, you're going to lose that information. And of course, you don't get anything from the non-coding transcriptome. All right, so on the other end of this is a technology that took off sometime in the last five years, single cell RNA sequencing. And there's many different ways of doing this.
And this in a sense predates technologies like TRAP, because you could just flow cells before that. Although as I mentioned, is was not an ideal experiment. So nowadays, a lot of these technologies rely on microfluidics.
So you associate a sample, you float it through this channel that basically encapsulates individual cells in these droplets of water that contain the reagents for reverse transcription and amplification. And this allows you to recover basically one transcriptome per cell type, as opposed to just the average transcriptome of one specific cell type.
So if you look in the neurodegeneration literature-- or just the neuro literature in general-- you'll see that people use the term single nucleus sequencing as one single cell. So as you can imagine, that implies that we're isolating the nuclei rather than the cells themselves. So why is this the case?
So it is-- I'm not going to say difficult. I'm going to say impossible to cleanly dissociate a brain sample. Even if you grow neurons in culture, you cannot cleanly dissociate them. They'll break. And if they break, all the RNA is going to spill out. And you're going to get contamination from one cell type in another cell type.
And if it breaks, then you're going to get a lot of unhappy cell signatures. You're basically just going to sequence the transcriptome of something that's dying. So one way around that is single nucleus profiling, where we say, OK, we'll just homogenize the sample.
And by homogenize I mean we literally grind it to a pulp, lyse everything, stabilize all the proteins, and just recover the nuclei usually by some filtration or some sort of density gradient, where we remove everything that's a pellet of nuclei and everything that's not a nucleus. And then we take that out. And then we flow just those nuclei.
And so these nuclei are isolated. They're flowed in these devices that is basically a channel filled with oil. And you flow little water droplets through one channel, the nucleus through another one. The nuclei enter the water droplets. And inside the water droplets, there's also a bead that is a bunch of polyT oligos.
And so once the nuclei is lysed, all the polyadenylated transcripts bind to the bead. And then you can do the whole reverse transcription and cDNA generation in a way that-- sorry, I should mention that the polyT oligos are also barcoded, so that you know that they all came from the same cell.
And then once they're barcoded, once they're reverse transcribed, you can pop all the little droplets, mix them together, and then you can just do a standard sequencing library construction. And you now have these barcoded single nucleus profiles.
So why do we not always do this? Well, we actually do nowadays. It's a very popular assay. And it's now our default assay for any pilot experiment, because we can use it an RNA sample. We can use it in cell culture. We can do TRAP in cell culture too.
We can use it in human tissue. That's the big one. And it can be fresh. It can be frozen. It can even technically be fixed. So you don't need all the setup that goes into a transgenic animal to do this profiling. And unlike TRAP, this recovers pretty much everything-- everything that's sufficiently abundant.
You sample anywhere from a few hundred to 10,000, 20,000 cells. So you might still have difficulty picking up something like a Purkinje cell. But for most cell types, this isn't a problem. And since now you have individual cell transcriptomes, you preserve that heterogeneity within each cell population, right?
We know there's many different types of astrocytes that we target. Just for example, GFAP expressing cells, we're not going to see the difference between the protoplasmic and the interlaminar and the fibrous astrocytes. But here we will see that.
And we recover anything that's polyadenylated ventilated. And that's not restricted exclusively to protein coding genes. So we've got a lot of antisense genes, pseudogenes, long non-coding RNAs. And you can also do this multimodally.
So from the same nuclei, you can also get DNA. So why do we not use it for everything? Why is it not a perfect substitute? Well, as I mentioned, assays like TRAP are very, very clean and they have very good detection sensitivity. The opposite is true for single cell and single nucleus assays.
They have pretty atrocious detection sensitivity. We recover only a fraction of the genes. And this isn't like a sequencing depth thing. Like, most of the transcriptome is literally just not captured at all. It's extremely noisy. It's very variable. All the droplets are subject to contamination.
And I'm going to go right out and say that there's a reproducibility crisis in the single cell field. Nobody can reproduce anything. Because we're using these barcoded transcripts, we lose splicing information. There are ways to do cell type specific full length sequencing.
Sorry, you can get full length transcripts from single cell assays. They're just much lower throughput. So we can't do it at the scale that we can do it with this scheme here. There's lots of spatial information. With TRAP, we can image things. We can see where those cells are, where they exist in the brain or within a specific brain region. We don't have that information here.
And one big thing that's specifically a caveat of single nucleus versus single cell is that we lose the cytoplasmic transcriptome. For immune associated cell types, this is actually a really big deal. It's been shown now-- for example, in microglia-- that a lot of these key immune associated genes-- I mean, everything has to come from the nucleus originally.
But we just talked about how the detection sensitivity is not great. Most of those genes are concentrated in the cytoplasm. And they are completely lost. And of course, this doesn't really apply to brain much. But you can't use it in samples that have cells with no nuclei or multiple nuclei, because then you have no way of either getting anything or keeping the nuclei that belong together together.
So even though we're talking about brain here, even when we profile other organs, single nucleus approach is more popular, because it does deliver cleaner data. But even in cases where you can still cleanly dissociate whole cells, it's still not ideal. But sometimes you might have to, because the cytoplasmic component is very important or because you're working with blood or liver samples that have these non-mononuclear cell populations.
OK, so what kind of data do these two assays deliver? So once we get our cDNAs in both cases, everything downstream of that is pretty much identical. The library construction, the way that the sequencing is done, the way that the data preprocessing and alignment is done. And we're not going to talk much about this. This is pretty standard across just RNA seq in general, how you do the alignment, how you do the feature quantification.
The output-- when somebody says, here's the data, please analyze it-- how it looks across both assays is effectively identical. The only difference is you get a matrix where your rows or the genes that you captured. And your columns are your samples. In the case of TRAP, one per sample, one per animal. Or in the case of single cell, the columns represent the individual cells.
But the format of the data itself is exactly the same. And once we're done curating the data at a later step, then they're actually identical. And I'll show you what that looks like. So a basic analysis workflow consists of basically three parts.
I don't have a slide for the third one. I'm just going to show you the third one. The preprocessing step doesn't really change between these two assays. You get the raw data. You QC it. You make sure that you can tell the A's and T's and G's and C's apart.
You then do an alignment. You don't actually have to do an alignment. You do some sort of feature quantification that spits out this matrix telling you how many times we encounter this gene and the sample or the cell. And then you align generally to the whole transcriptome-- human, mouse, zebrafish, whatever.
And we're working on brain. Not every single gene is expressed in brains. So we can remove all the features that are not present, that we don't expect to find. And then we normalize the data because there's going to be variation across batches, across samples.
The sequencers aren't perfect. Sometimes you get more reads in one sample versus another. Maybe you recovered a-- you usually don't recover the same number of cells in a single cell workflow. But you target the same number of reads, so some cells have more reads than others. So you have to account for all of this.
And once you have that, then this is where things start to diverge. Most of the QC for TRAP-- and really, any bulk RNA seq-- most of the QC is done at the preprocessing stage. If your reads are bad, if you have samples that are bad, you toss those.
But once you verify that your raw sequencing data is good and you've done all the basic pre-processing, from here on out, it's very, very simple. You can look for things like batch effects. You can look for things like outliers and maybe drop a sample or two.
But you can pretty much for the most part just go straight to the downstream analysis. The same is not true for single cell. Most of the QC, most of the data cleaning, actually happens after those steps. Once you have your counts, once your counts are filtered, then what you do is you generally visualize the data.
What you see is something very, very ugly. And instead of removing samples, generally we actually remove individual cells. Because I should say when you sequence these cells, they're not necessarily cells. You sequence droplets. And every droplet has a bead with some DNA, right?
And you sequence all of those. And most of them will be empty. The capture efficiency is very low. It's about 1%, less than 1%. Less than 1% of your droplets which actually have cells or nuclei in them. And this is by design, because you would rather have no nuclei in a droplet more than one.
So it's a long and arduous process in determining what is a real cell and what is not. And among the real cells, which ones are stressed? Which ones are damaged? We don't want those. Which ones are contaminated? Which one's got multiple nuclei?
And this is a long and interactive and iterative process. Whereas this same process for bulk and bulk-like methods is very straightforward, very fast. It also helps that TRAP already gives you very clean data it doesn't help that single nucleus already gives you dirty data. So this part is very, very different.
So earlier this year, we taught a course on neurogenomics, computational molecular neuroscience. And we didn't really have assignments. What we had is a series of tutorials. We call them problem sets. They were actually tutorials. They don't have any questions in them.
Over a number of various things, including how to analyze TRAP and snRNA data. So you can actually go to that GitHub link you can download these. And they're very extensive. They're basically reading assignments with interactive portions that tell you how to do things and give you all the necessary background and some real example data, not like toy data, but actual published data that you can play around with.
So feel free to download those. I'm going to use a slightly different-- I'm going to use these two specifically that I have here, the TRAP analysis and the single nucleus RNA seq analysis assignments. And I'm going to try to run them live. But some things might run quite slow. So I don't know if it will actually work.
If we're running low on time, I do just have the notebooks prerendered, so we can just go through them line by line. And we also have a Docker image that contains all the necessary tools and software you need to do this. So you ideally run the assignments in this Docker and everything will work.
And then we get to this. These are advanced methods. So our lab does-- the Kellis lab in particular does a lot of method development, including tools for single cell RNA seq analysis. So when I mentioned the recent developments or advanced techniques at the beginning, that was what I was talking about. And if we have time, we'll get into this. And this has more theory associated, why we don't just do the vanilla analysis when it comes to looking at this data.
OK. So I want to move over here to the notebook. As I mentioned, once we do all the alignment, all the quantification, all the QC-- which usually most people don't have to do themselves. If you send out a sample to be sequenced, they'll just give you this part, because it's very straightforward. Or when you buy a kit, they might recommend some tool that will do all this for you.
And basically, what you're going to be handed is in the case of TRAP data or any RNA seq data-- because ultimately, the TRAP data looks like any other RNA seq data-- is a matrix of counts that looks like this. So it's a table, some CSV file or some TSV file, where you have a column corresponding to each sample and a row corresponding to each gene.
And each entry is the number of times that gene was encountered. Now, depending on how you generated your library, these could be unique counts or they could be reads or fragments per kilobase million, which means that you sequenced the whole transcript. Sorry? OK.
There's ways to do it such that you can make sure you only sequence or align each transcript once. And if you do that-- that's what you do in single cell-- you lose the splicing information. You don't have to do it here. You can do it here.
But if you sequence the whole thing, then you don't know if this fragment from axon 1 and this other fragment from axon 8 came from the same transcript. And if one gene is 10 times longer than another gene, then you have to scale that gene by 10. Otherwise, you're going to get 10 times as many reads-- or 1/10, rather.
So that's what the values here represent. So in this case, we have data from eight mice. This is TRAP data from D2 expressing spiny projection neurons. So they're D2 something something something. And we have four control. Just C57 black 6 mice. Actually, are they? I think they're CB-- they're control mice.
And then four R62 mice. This is a mouse model of Huntington's disease, pretty severe one. So we have four replicates for each. This is what our data looks like. And this is without doing any sort of data cleaning. This is just straight off the sequencer. It's been aligned. It's been quantified. This is what it looks like. These are the raw counts.
So the first thing we're going to do is we're going to remove the features that are not present, or the features that are so lowly expressed that we can't actually do anything with them. So I'm going to remove everything that has a variance of 0 and everything whose variance falls to the lower 10th percentile.
And then I'm going to show you how many genes we filter out. So we align to a transcriptome containing 21,300 genes. And we're left with 15,591. So this is just a sanity check thing. And now that we have that, our smaller data is always easier to work with, especially when we're doing matrix manipulations, because they're computationally expensive. Very expensive.
So the last genes we have, usually, the easier our life is going to be. Now what we're going to do is we're just going to make a little table of metadata that we can use for downstream analysis. And all this is going to do is we have all the necessary metadata in the column names here. I'm just going to split them and make another table that just has cell type, genotype, replicate number.
It's going to look like that. So I have four samples, four controls for R62. Four replicates in each. Which is very, very simple. And now what I'm going to do is-- how familiar are people with the concept of differential expression analysis?
So I should point out, the typical analysis workflow when we generate our RNA seq data is-- the two things that everybody does immediately-- you can do all sorts of fancy things. But the first things that everybody does immediately is they look at differential gene expression and pathway enrichment.
That's like the h. To so differential gene expression is literally just control genes, disease genes, are the disease genes up or down relative to the control? By how much? And we assign a p value of how much we believe that they're actually changing.
There's another tutorial notebook on that same GitHub page that walks through in detail through how that works. You can do it manually, implement it yourself line by line, how to compute full changes, how to compute test statistics, all that stuff. So I'm not going to go through that.
But that's basically what we're doing here. And we're just setting up the data to use a package called DESeq, which is super popular, that just automates this whole process for us. We give it the data. We give it a table of metadata. We tell it exactly what our covariate of interest is.
In this case, it's the genotype that determines if it's a healthy mouse or an HD mouse. And then it just spits out a table of genes telling us how much these genes have changed. So this is just a preprocessing step required of this. We need to encode it as a categorical variable, instead of just a character. It's not really very important.
And then we're going to normalize the matrix. There's many different ways of normalizing gene expression data. Since some samples are going to have more reads than other samples, we can scale by the library size. We just take the sum of all the gene expression values and divide them all by that, so that the sum for each sample is 1.
That's a very straightforward way of doing it. We can subsample such that every sample has the same number of reads or every cell has the same number of reads. That's reliable. But when you do that, of course, if you have a poorly sequenced sample or a cell with very few reads, there's going to be a huge loss of information everywhere else.
There's also fancier techniques, like quantile normalization that you can do, to make sure that distributions have the same anchor points. Here, we're just going to do-- this function that we declared earlier here is literally just doing library sized normalization. And then we're going to log scale.
Because when things change-- sorry, where was I? When genes change in expression, they don't change linearly. They change by orders of magnitude. So we log scale things. Otherwise, we get wild values that are not very intuitive.
So I'm going to do that. And this is what our data now looks like. So we no longer have integer counts. We now have log scaled normalized counts. And then the first thing we're going to do here is we're going to just do a simple PCA. And that's going to tell us basically if we have any outliers, right?
So we have two components. The first two components are sufficient to describe the variance between our control and our HD genotype. So that's good. If the blue and the red were smeared together, we would be asking why and we would probably be repeating the experiment.
So knowing right off the bat that there are differences between them and that they separate by genotype is a good sign. This is just a simple but essential QC. We have one that's-- I'm not going to call it an outlier, but it ideally would cluster better with the other three.
So this one has a little bit of batch variance. But overall, this is a good trend. You can draw a line between them and you could separate the control and the R62 samples quite cleanly. Now what we're going to do is we're just going to feed our counts, our metadata, and create an object that this software package will take to automate the differential gene expression analysis for us.
And then we're going to tell it what variable do we want to compare. In this case, we have only one variable. That variable is our genotype. And we want to know the difference between R62 and control. You can give it other covariates. So if you have human data, you would want to give it things like sex, age, anything else.
If you're looking like Alzheimer's or Parkinson's disease, you want to give something like a Braak stage, because what you want to do is you want to isolate the effect of your covariate of interest. You're going to see effects. But those effects might be a combination of multiple covariates. We want to regress out everything that is not just the case of control or driven by the case of control.
So we're going to put it into convenient object that basically combines all of this in a way that we can just run one line, run the algorithm. And what you'll notice here is we didn't get the normalized counts. Depending on what package you use for differential expression analysis, you can give it the normalized counts or counts with weights.
This is a very simple method. This is why people like it, because you just give it the raw data. You give it the covariates. You tell it what each column corresponds to. And it'll do everything on its own. It'll estimate the library sizes. It'll estimate the dispersion, which is the relationship between the mean and the variance.
It'll do all the model fitting. It takes care of everything for you. And then it gives you a list of results. We only gave it one variable, so this is our only result is just the differential expression of the genotype, R62 versus control.
And what we're going to do real quick is we're just going to pull it out, because it's in a weird format. And we're just going to turn it into a table that we can read. Shouldn't take that long. And that's it. And this is our output.
We now have a table genes. This is the mean expression across our samples. This is the fold change. This is what we care about. This is the magnitude of our effect size, the standard error of the fold change, Wald statistic, and then our p values, which are also important.
So we mainly care about three columns-- gene, fold change, p value. That tells us everything that we need to know, for the most part. And what we can do is now sort by effect size. And we can see what are the genes that are changing the most. Are these genes interesting? What are they doing?
So that's kind of it. We didn't have to do much preprocessing once we got the raw data. The data was already very clean. We could condense all this to maybe 5 to 10 lines of code. And that's our results table. You'd make a nice little, I don't know, bar chart or something, throw this in an Excel file with supplementary data. It's ready to publish.
So how much can we trust these results now, is the next question. So let's pull up the columns that we care about. Let's sort by-- what am I sorting here? I'm sorting by p value. OK. So these are our top hits by p value. And right off the bat, if you're familiar with the HD literature, you know that these genes are the canonical differentially expressed genes for spiny projection neurons in the striatum.
In HD, these genes are dramatically depleted. And we can see that right off the bat. So this is a good sanity check that our data is correct, it's reliable. And now we can by fold change and we can ask it, what are the things that are changing the most? Because now, these are genes that we haven't necessarily seen before.
OK, so we go back to this thing. This seemed like it was very easy to do. It took us all of five, 10 minutes to do it. And now we have a list of genes. What do we do with these genes? How much can we trust these genes? We look at the top hit.
The thing that's changing the most is this gene called Gpx6, which at the time we didn't know anything about it. Is it really that simple? Is it so simple that you can just get the raw data, run 10 lines of code, you have your results, and that's it, you have a hit? As it turns out, that is-- where are we?
Did I lose it? No? All right, one second. It turns out it really is that simple. So from that data, this is the data that we shared with the class. We didn't do anything to it. This was just the raw data as we received it.
It's pretty much exactly the way that we analyzed it. And just doing this incredibly simple analysis, we got a hit. It's that simple. And turns out that if you restore the expression of this one gene, that turns out to be the most differentially expressed one in this mouse model, the mouse gets better.
So this type of data is actually very simple to work with. And because it's already so clean right off the bat, that we have to do very, very little to start off. Let's see, sorry. This is going back to the wrong one. Where are we? I think I lost-- oh, there it is.
Sorry, I have two windows open, because I wanted to preload the single cell data analysis part. So the rest of this-- which is in the GitHub-- just shows how you would repeat this process with a different software package. It's going to give you the same results.
But essentially, it really is that simple. Once you get your hands on the count matrix, you do some very minimal manipulations, you remove the excess genes, you normalize your counts, you stick them into a tool like this.
And even doing it by hand-- computing the statistics, computing the effect sizes, the p values, is something that can be done very, very quickly very, very easily. So this is why RNA seq is so popular, because it's not a steep learning curve to start using it and it's quite reliable.
So the caveat is that TRAP itself is not all that simple. The experimental component does require a bit of a steep learning curve. And that's why we don't use it for everything. But once you do the experiment, once you get your hands on the data, it's super easy to deal with after that. And this same workflow you can use for any bulk RNA seq data or any data that's in a similar format.
So we can take a look at more advanced topics. One is we have a data set here-- so is it already loaded? Probably not loaded. I'm just going to load it real quick. We generated TRAP data from an allylic series.
So Huntington's disease is caused by a mutation of the Huntington gene. There's this repeat, CAG, CAG, CAG. It encodes a Q in the protein. And most people have 20. And when that is expanded, you have Huntington's disease.
And the longer this repeat is, the more severe the phenotype is, the earlier the onset. So we have a mouse model that is an allylic here. Basically, we have a control mouse that has 20 Q's. And then we have these expanded alleles in these various mouse lines.
So we have mice with 50 Q's, 111 Q's, 175 Q's, and we do TRAP on all of these. So we can do a different type of analysis with this data. So what happens when we plot that? When we do our PCA plot, we have more replicates here. We have a different covariate that isn't quite as black and white.
And right off the bat, you can see, OK, I have two outliers here in this data. Things for the most part separate well. So you can see they're in order. 20, 50, 111, 170, 175. OK, it's not too bad. But we have this one here that we don't like. We have this one here that we really don't like.
Do we necessarily have to toss these samples? Not really. So if we know what's driving these, if we know what external variable is responsible for this drastic difference in gene expression, we can regress that out quite easily.
But let's say we don't know what that is. There's a couple of tools for removing unwanted variance, or identifying variables that drive our gene expression that we don't want or we don't care about. One of them is a very popular one that's called surrogate variable analysis.
And there's a package called SVA that does this for you. And basically, you tell it the covariate you care about and you say, OK, I want my gene expression to be driven purely by the Q length. I don't care about anything else. Remove all the effects that are not driven by Q length.
So we do that and it identifies covariates that are not Q length that are responsible for our variation. And here, it found 14. That's a lot. So now that we have those 14 covariates that are unknown, we can remove them.
So all this function does is just regresses them out. And now we can do our PCA again. And now it looks beautiful. We no longer have outliers. And what we do see is the 20 and 50 are still a little together. It's actually known that whatever the phenotypic threshold is in mouse is actually between 50 and 111. So we kind of expect this.
So now the data is a lot cleaner. We can take our corrected counts. Our counts now look like-- let's see, can I load them? They now look like this. They're different than the original counts. But now we can analyze these corrected counts just the way we did before. And we don't have to worry about the batch effect that we had.
If the data is really dirty, if there's a true outlier somewhere in there that's driven by something very extreme, this doesn't work. But for the most part, this is-- if your data just looks a little bit off, these are the kinds of techniques that you can use to salvage it.
The next thing I talked about is doing gene set enrichment analysis, which now that we have a list of differentially-expressed genes, we can ask, OK, what do these genes have in common? Are they part of the same pathway, the same process? Are they associated with the same disease? What cellular processes are these genes part of?
And what we basically do in essence is we compare against the database of genes. So somebody's curated a database where we have genes associated with Huntington's, genes associated with Wnt signaling, genes associated with oxidative phosphorylation. And we can do a simple enrichment test and say, are my dysregulated genes overrepresented in any of these gene sets?
So here, I'm just going to load the R62 D2 TRAP data again. And then I'm going to load a database pathway from KEGG, the Kyoto Encyclopedia of Genes and Genomes I think. I think what it's called.
And all this basically is is a list of genes. So it's a list of pathways. And then for each pathway, we just have a list of genes that are associated with them. So we have Wnt signaling. Here's five genes that are associated with Wnt signaling. Here's five genes or six genes that are associated with ALS.
And now we do a simple hybrid geometric test. And the end result is a p value that tells us how enriched are our differentially expressed genes in these pathways. So if we sort by p value, all our top hits are these different types of synapses, things like axon guidance, things like circadian entrainment. Mostly synaptic terms.
So we know that this is accurate. This has been validated before. So that's just a manual way of doing it. There's a lot of fancy packages you can use. For example, gprofiler that allows you to query many, many different databases. It allows you to compare against different organisms.
You can tell it to return only the statistically significant ones with this threshold, correct your p values for multiple hypothesis testing. So all sorts of fancy things. We're just going to try the KEGG pathway again. This time, we don't have to do it manually. Just going to query the most recent version of the database.
And the end result is pretty much the same. So there's many ways to do things manually. But if somebody has already developed a tool to automate it, it saves you a lot of time. So this is the most naive way of doing it.
There's some fancier methods like GSCA, which do all sorts of fancy things to determine pathway enrichment. Because these very naive methods are prone to a lot of false positives. So you see things here that don't really make sense. Like morphine addiction, cocaine addiction, that's not really relevant to Huntington's disease.
And it might be just because this naive method doesn't have a way of distinguishing things that are relevant versus irrelevant. So that's just a very basic-- not even necessarily TRAP oriented, but just RNA seq workflow.
And if you look at a paper that has done RNA seq, these are at least the two things that everybody does. It's very straightforward, very easy. And when you do an experiment like this, that's really most of the time what you need to do. You get a list of genes. And those genes are your hits. And those are the genes you can go investigate.
You can target them, knock them down, overexpress them. What RNA seq does is it's not really a terminal experiment. It's not really the thing that you're going to publish, although people do. It gives you leads onto what to pursue, right? So what you get is very-- you get these large tables of results that are basically ranked in order of importance.
Well, this has the biggest change, has the smallest p value. Maybe you should go after this gene first. So RNA seq is really an exploratory tool more than anything. But it's also something that's very easy, very accessible.
So now we can ask, OK, this is what's-- we can now ask, how do we do the same thing, but with cell type specificity? What additional information can we get if we were to do this with single cell resolution? So we'll go to that.
All right, this one I kind of preran, because it runs kind of slowly. So a little bit of background. I guess I talked about this. This is how we barcode our genes. So the end result is that we get a matrix that only has counts information.
We don't know anything about splicing variance. We don't know anything about sequence variation in our transcripts. We usually read between 20 and 100 bases from each end of a transcript. And that's enough to uniquely map to any feature of the genome.
So what we end up with is just the table of genes by cells and how many times that gene was found in that cell. Now, there's four steps before we can get to the downstream analysis. The first is the alignment. We've already done that. The alignment in feature quantification.
The next things are cleaning the data, clustering the data, and annotating the data. What does this mean? Cleaning the data, as I mentioned, single cell data is very, very dirty. And most of our signal is not going to be biologically driven. It might be technical variance.
A lot of the biological signal is going to have cells that are damaged, cells that are contaminated. So this stuff has to be removed. Unfortunately, this is something that is done with-- people will say like, oh, there's these universal metrics. Like oh, mitochondrial RNA contamination or all these tools for finding doublets.
The truth is that it's very experiment specific. Right? So I can say, oh, let me get rid of all these cells that have a lot of mitochondrial gene expression, because there shouldn't be any mitochondrial gene expression. The mitochondria should be sequestered to the-- sorry, the ncRNA should be sequestered to the mitochondria.
But they're in my sample. That means that the cells probably lysed. It's not really a good cell to profile. The truth is that if you're studying Huntington's, if you're studying ALS, that is actually a disease phenotype, mitochondrial RNA release. It then binds to the ribosomes that are attached to the nucleus, which you pull down. So it's not that simple.
You have a droplet that looks like it came from two different cell types. You want to call it a double and remove it. But the truth is, you might have some cell that is in the middle of some developmental trajectory, and it's in a transient stage where it has both, I don't know, oligo and OPC marker gene expression.
So it really depends on what you're profiling. And no two dissimilar data sets are going to be cleaned the same way. So it's hard for me to give a tutorial on how to do this part. Because I can show you something that might not apply to you in any way, shape, or form if you're looking at something else.
So what I'm going to do is a little bit of a disservice. But I'm going to pretend that our data is already clean. And this is unfortunate, because the data cleaning part is actually the most time consuming part. But if I were to show you that, we would take up the rest of the whole hour.
So we're going to go straight to the next part, which is clustering and annotation. So since we're profiling many different cell types and we're recovering all these different cell types, what we're going to end up with is clusters of cells, each in theory representing different cell types. We're going to have similar gene expression patterns that are going to be dissimilar across clusters.
And once we identify them-- once we identify the actual groupings of the cells, we want to know what cell type is present or represented by each of these clusters. That's the annotation part. So you need some sort of prior knowledge on the brain region or the organ you're looking at to know what cell types to expect and how to label them.
So we're just going to assume that we have good, clean data to start with. And I'm going to show you what that looks like. So I mentioned that the format is exactly the same. But because their detection is low and we're sampling so many cells, if we were to just encode it as a matrix like we had before, it would be a very, very large matrix.
And people do that. They'll upload these 10 gigabyte CSV files with their data. And it's very difficult to work with. And of course, most of it is empty, right? So the common way we encode single cell data is in sparse format, where we want to keep only the non-zero values of our data.
So usually, you have a table-- a matrix that is in MTX format. And the MTX format is basically just three columns-- the I index, the gene; the J index, the cell; and then the value there. And you only keep the non-zero values.
And then you have another table that tells you what the columns represent. Other table tells you what the rows represent. And you can put them all together. And that gives you your gene expression matrix. So that's what it looks like here. I think it's loaded. Yeah.
If I try to open it, like I showed you the other one, it might crash. So I'm not going to do that. But it looks the same as the TRAP, but just the columns are cells. And we're going to do the same thing. We're going to start doing the same thing.
We're going to filter things that are not present, because not only are we looking at a part of the brain that doesn't express every single gene, but also our detection sensitivity is a lot lower. So we're going to capture far fewer genes.
And right now-- so what I'm doing is I'm removing genes that were not detected at all. They have zero counts across the board. Or because I also only want to deal with genes that are sufficiently abundant-- because those are the only ones that can be analyzed-- I'm going to remove everything that isn't present in at least 1% of cells.
Why 1%? Because you have very rare cell types. In this brain region, for example, we have T cells. They make up less than 1%. And they're important. We don't want to remove them. We don't want remove all the important genes that are responsible for T cell identity, because then we won't be able to identify those cells.
So once we do that, we filter out about half our genes. We filter out a lot more than we did in TRAP. And now we do the exact same thing. We normalize our data the same way. We scale the columns. We log transform them.
I think we already did this part. I don't have to run it again. And now we're going to start doing the same thing we did with the TRAP data. So I'm just going to plot PCA. I'm just going to plot the first two components. And we'll see what we get. Oops.
It didn't actually read in all of these. All right. OK. I tried running this beforehand, but I don't think it actually ran all the way through. So I'll just do that. This will take just a second to reload the data.
Again, we need to make our table of metadata, so that all these algorithms know what they're dealing with. I'm just going to show you what that looks like. In this case, the metadata is actually by cell. So it's not even going to show it. It's really big.
So we have the same table, but it's one for each cell versus one for each sample. OK, let's renormalized data, refilter it. We do the again. OK. So we have more than eight points this time. Specifically, we have 6,790 cells.
These are from one mouse and 14,000 genes that we've reduced to 50 dimensions by PCA. And we plotted them out. And now I already cleaned this data. I already annotated this data. I know what's in it. So I'm trying to convince you here-- if you look at the legend-- that there are 16 distinct, well-defined cell types here.
Now, I color coded them, so you can kind of see most of them, or some of them, at least. But if they were not color coded, if I just showed you everything-- oh, and I think these are all controls. There's just one healthy C57 black 6 mouse here.
If I were to just show you this plot, you wouldn't be able to tell that there were 16 cell types. You would not believe me. There's maybe six or seven badly smeared clusters. One here, one here, one here, one here, and then something's going on here.
So this is what people used to do, if you look at the very early single cell papers. And if I were to tell you, oh, there's 45 distinct cell populations in the human motor cortex or the mouse motor cortex, you wouldn't-- this is striatum. There's maybe like 30 something.
You wouldn't be able to get that information from doing this naive analysis. So about 15 years ago, people developed tools that were very good at dealing with this type of data. The first big one was called t-SNE, which people used in single cell quite extensively up until recently.
And what this does is it takes very, very high dimensional data and it embeds it into 2D or 3D space for visualization. And there are dimensionality reduction methods like PCA. But these are specifically optimized for visualization-- for the visualization of high dimensional data.
So we don't use t-SNE much anymore, because it turns out that there's even better algorithms for it. The very popular now that's fairly ubiquitous it's called UMAP. So I'll use that one as an example. And this will give us a much better two dimensional projection of our 14,000 dimensional data.
Now, although UMAP itself does the dimensionality reduction, it works better if you give it the PCA already, because then it has less features to worry about optimizing. So we're going to do that. We're going to give it the PCA, the top 50 components. And we're going to let it do its thing.
And by default, UMAP is going to try to arrange our points in 2D space. It's going to try to cluster our data, basically making efficient use of space to cluster our data based on how similar dissimilar points are in a way that preserves much more information than PCA does.
So hopefully this runs quickly. And if not, that's fine. OK, this looks very, very different right off the bat, as you can see. And here you can see closer to the 16 cell populations that I told you about. And you can also see structure.
You can see, oh, these two look kind of similar. I wonder what they are. Spoiler, these are D1SPNs and these are D2SPNs. They're known to be very similar. They should have very similar structure. PSA wasn't going to tell you that, though. Right?
And now we see that there's these clusters of cells. We can assume that each one is a different cell type. So each unique cluster corresponds to a specific cell type. And now what we can do is we can take this plot, this projection, and just superimpose genes on it.
So just make a heatmap of four canonical markers, Drd2, which is in D2SPN's CLU in astrocytes, MBP, and oligodendrocytes. And ADARB2, which is in this interneuron population in the striatum. And we can see what that looks like.
So we see that one cluster and only one cluster is enriched for Drd2. One cluster is enriched for CLU. One is enriched for MBP. Right? So this is the direct way of doing cell type annotation. We take our known markers-- and this requires that we know what the markers are. It requires that we know what the cell types are.
Sorry, is there something in the chat that I missed? What kind of clustering did you use to cluster the data to 16 components? I'll get into that. We used a tool that our lab developed that I'll go into more detail later on.
Yeah. You need to know what your markers are. And you need to know what cell types are present. And you don't always know these things, right? Especially when you're trying to find de novo markers or you're trying to identify new cell types, this can be a challenge.
But this is the naive approach. And here we can start assigning identities to each blob of cells. Now, what do we want to get clustering out of the data? We want the data to tell us, how many cell types do I expect to see?
Because I don't know how many cell types there are. So UMAP and similar algorithms already do something where they create a graph. So the way that UMAP determines cluster points is it asks for each point, what is the Euclidean distance between each point?
And then I don't know, I think the default implementation-- it asks for each cell or each data point, what are the 15 closest other data points? And then it makes a graph. And then that graph is-- I'm probably getting too much into the theory behind this.
But as you can see, there's a lot of text in here. And you can read through it. And it goes into all this in detail. And it's got links to various resources, if you want to learn more about each of these specific components.
And all of this is public. It's on the GitHub. All the data I'm using is also public. And all these notebooks should download them automatically when you run them. So where were we? We were here.
So it generates a graph. And then it uses that graph to optimize the visualization. So when we have a graph-- and it's very popular to do these graph theoretic approaches when analyzing single cell data, because you can treat the data as a graph.
And that opens a whole world of possibilities in terms of data analysis, that you can encode your data in this format. So we do. One of those things is that you can apply all sorts of clustering algorithms. So in this graph, where are the hubs? Where are the neighborhoods of activity or the neighborhoods of information present?
And we can apply algorithms, like Leiden, which are very popular, which assign-- oops. Apparently I didn't run this. They assign membership. First it tells you, OK, in your data, there's 10 groups of informative things.
And then it assigns every point in your data to one of those groups. And this is what it looks like. So it identified 14 clusters, some of which match our cell types perfectly. And mind you, the visualization has nothing to do with the underlying structure of the data.
The clustering algorithm doesn't care what your plot looks like. You can run on the same data that went to the PCA plot. And as long as the underlying graph structure is the same, it'll give you the same results.
But you can see that they match quite well, for the most part. So I identified 14 cell types. Some of them are overresolved. Some of them are underresolved. But it's pretty good right off the bat. And we'll come back to this and talk about more advanced ways of doing this.
But right now, we've just been playing around with the raw data. We're handing matrices and tables to all these algorithms. We can do what we did in the TRAP case, where there's these convenient software packages that you hand all the data to and they do everything.
And so we're going to create an object called a single cell experiment object. There's many different types. If you're familiar with Python or Python tools, a very popular one there it's called anndata. And these are the ones that you often see cited in the literature. People use these packages, because they just make life easier.
So we're going to give it the count matrix. We're going to give it the table that tells the object what cell each column corresponds to, what row gene each row corresponds to. And it just makes life easier to work with.
So now we have an object. These are the dimensions, 14,000 genes, 6,700 cells. We have our log normalized counts. Our rows are genes. And then we can add other metadata, like this gene has this ensemble ID, it belongs to this chromosome, it's protein coding or it's long non-coding, whatever.
And we can say, oh, this cell type has this barcode. And once you annotate your cell type, you can say, all these cell types belong to this group of cells or belong to this genotype, or whatever. So this just makes life easier. And then we can also add reduced representation.
So we can add our PSA matrix. We can add our-- yeah, our PSA matrix. We can add our UMAP visualization to it. And this allows us to deal with the data very, very easily. And now we can say, OK, here's my object.
Let me just pull out everything that's in cluster 1 and visualize it. And that's it, right? So this is what was here. This is cluster 1. Now we're pulling it out. And now we can-- it allows us to easily manipulate the data and do all sorts of fancy analysis.
Now I have a clustering. So before I showed you that, these genes are expressed in these specific clusters. But this is by I, right? These cells have not been grouped thus far. But now I did a Leiden clustering, where it tells me about 14 different groups of cells.
And I want to ask, OK, what does my gene enrichment look like within these cells? And I can have here-- so these are five canonical oligodendrocyte marker genes. And I can ask, in what cluster are they present? And turns out, they're all highly expressed in this cluster 3.
So I can go back to this plot. And this is my cluster 3. And I can say with confidence, OK, everything here is an oligodendrocyte, most likely. So it'll annotate my cells accordingly. And then I can also do something very neat, which is de novo marker gene detection.
So sometimes we know what the markers are. Sometimes we don't know what they are and we would like to know what they are. Or sometimes we know what a few markers are but we'd like to find more of them. So this-- again, in that GitHub repository, there's a tutorial on how to do this manually.
This is just running a Wilcoxon rank sum test, which is asking, OK, I'm telling you that I have 14 groupings of cells. Tell me what genes are specific to each cluster, right? Which genes are differentially expressed between clusters.
Or rather, in this case, what's enriched in cluster 3 versus everything else? And this is the naive but also very reliable way of identifying marker genes. And if we do that and we pull out the results for cluster 3-- enriched for these five canonical oligodendrocyte marker genes, we see all the ones we found before, all the ones we knew about, as well as a bunch of new ones.
And I'll show you how we can validate that all these other novel markers are actually accurate. So we'll go back to doing how you do differential expression analysis in single cell data and stuff in a bit. But let's talk about an advanced approach.
So these are the tools that our lab has developed. Let me stop sharing that, go back to sharing the PowerPoint slide, move this out of the way. OK. So our lab is working on a tool that we've now used for a number of studies for doing cell type identification and clustering analysis and just all sorts of things.
So before what we've been doing is we've just been reducing our data with PCA and then using one of these fancy visualization algorithms to optimize-- well, 2D visualization of the data, and then doing manual annotation. And I told you there are 16 cell types.
We still haven't recovered all of them. Some of them are rare. Some of them are difficult to find. So how can we find them? How can we find novel marker genes? How can we-- oh, I should also point out that because the data does not have good detection sensitivity, it's quite sparse, we're missing a lot of information, right?
So there's a lot of genes. If I show you the gene expression maps again where I showed you the four markers, the four clusters, you noticed it didn't highlight every single cell there. But they should have. So how do we deal with this data sparsity, which can be detrimental?
So I mean, I can go into the detail. But I'm just going to-- I think we can put the slide up. I can probably put all these slides in the GitHub, because there's papers associated with them. But basically, instead of doing just a naive reduction and then plotting principal components or using them to plot UMAP, what I can do is use the-- well, let's put it this way.
When we assign cells membership to a certain identity, we ask, OK, what does the average gene expression profile look like for this cell type for that cell type? And is this cell most similar to that or is it most similar to some other reference expression profile?
And so all of these methods, what they do is they look at the average profile of each cluster, and then do things like annotation based on that. Well, when we want to find very rare cell types, this doesn't work very well.
So there's this idea that in biology, we have a lot of redundancy. But also there are cases where we don't expect to see redundancy. And in cell type specialization, that is one of those places where we don't see as much redundancy as we do, for example, in the molecular mechanisms and housekeeping genes.
So we argue that if two cell types are very, very similar, then they're doing more or less the same thing. So what we want to look at, we want to identify distinct cell types. We want to look at the things that are most dissimilar to one another.
So there's an algorithm called archetypal analysis. It's quite old. And basically what it looks like is you give it a very high dimensional data. And in this high dimensional data, it looks for the most dissimilar points of the data.
So if you consider that if your data is three dimensional, you have a triangle, the three most extreme points would be the vertices of the triangle. Right? And then if we put a bunch of points within the three vertices of the triangle, those vertices are still your most extreme points. And we call those points the archetypes.
If you have 10,000 dimensional data, finding the corners of this 10,000 dimensional volume is very, very difficult. So what we do is we try to estimate them. And we ask, OK, start by showing me the two most extreme points and show me the three most extreme, than the 30, 40. And we usually stop around there.
And that's usually enough. Because once we have these points-- once we have the corners of our data set-- every other point that lies inside of those boundaries instead of the boundary defined by those points can be expressed as a combination of those points.
And what this lets us do is create a reduced representation, like PCA, that is actually a lot more informative. And what we do is that-- well, let's go back to the tutorial. This representation also is not only-- actually, wait. No, let's not do that. Let's go here.
We can use this representation and then also construct a network out of it, just like UMAP does with KNN. But instead of using the individual data points, we use these extreme data points. And then we can use that to assign membership of every other cell in the data set to these points.
And what this allows you to do is give you a very, very high resolution visualization as well as a very, very high resolution network that is much more faithful in representing the underlying biology. This might sound a little terse.
The reason is it's kind of complicated. There's a lot of math behind it. I'll put the link. There's two papers associated with this, if anyone's interested in reading them. Instead, what I'm going to do is I'm just going to show you what this looks like and why it's better than the other things that we were just looking-- the naive methods.
So we developed a software package that implements this algorithm. And we start off with the same thing. We create an object that contains our data. And it's not rendering. That's down here, actually. And it looks almost identical, because it's derived from a single cell experiment. It's very similar.
And we can do the same thing we did before. We're just going to filter our genes. We're working with the same data still, by the way. We're going to filter our genes. We're going to normalize them exactly as we did before. And then we're going to run our algorithm, which is composed of two parts.
The first is the reduction. And what this does is first, it's going to-- well, I'll explain it. First, it's going to do SVD, which is almost PCA. But what we're going to do is we're going to remove uninformative sources of variation.
So every single cell is going to express the same set of housekeeping genes. And those housekeeping genes-- there's not just a few of them. There's a lot of them. So there's a lot of information that isn't informative. So what we're going to do is we're going to remove that component of the data, the mean expression profile that is consistent across everything.
And what that's going to do, it's going to amplify the things that actually define cell type specificity. It's going to amplify the signal of the genes that are actually associated with cell type identity or with disease or with cell state, whatever.
And then we're going to run this algorithm. It's going to do a couple of things. Going to take that. It's going to decompose our already-reduced representation using the archetypal analysis algorithm. It's going to create a new representation based off of, by default, the 30 most extreme data points.
And in theory, these data points represent the canonical profile of our various cell types. But what else is extreme in data? Well, noise, outliers, junk. And this is one of those that's not a feature-- it's not a bug, it's a feature type things.
Because the algorithm captures those things. It's very easy to just identify it and then remove it. And then whatever's left is true biologically-driven variation. So we're going to run that. We're going to create a new representation. And then we're going to use that representation to construct a network using an algorithm called kstar-NN.
And this network, just like the network we use for UMAP, it's going to assign-- it's basically going to tell us how similar every cell is to every other cell. But instead of just doing the Euclidean distance, which is what most of these algorithms do-- they just take the Euclidean distance.
We can do better than that. We can use something called the Jensen-Shannon distance. So the Jensen-Shannon divergence is a measure of how similar or dissimilar two distributions are-- two probability distributions, really. But we can pretend that gene expression distributions are probability distributions. This works quite well.
And then we take the square root of this value and it gives us something that is axonometric. Which means that if you move the two points around in space, their relative distances stay the same. And this gives us a measure of cell-to-cell similarity for every single cell in our data set.
And this allows us to do some very fancy things, like I'll show you. So right off the bat, we run this algorithm. We feed this network to UMAP, instead of letting it make its own at work. And we do some fancy things to make the UMAP look a little prettier. It's just mostly for aesthetic purposes.
And the end result is something that looks like this. And I told you there were 16 cell types. And now this is far more convincing in showing you there are in fact 16 cell types. Right? You can actually see all of them here. And they're very well separated. And they also make a much more efficient use of space than vanilla UMAP does.
And right now, they don't have identities assigned to them. But I'll show you how we can do that. And if we compare it to the previous-- just the PCA representations-- it's going to make me want to run this. Let me run list.
So this might take a second. So this is a very small data set. I don't know, if you've never worked with single cell data, you quickly find out that this is not the type of analysis you can just do on a laptop. It's actually very computationally expensive. And it's a big part of the field, finding tools that can do these types of analysis efficiently.
Because it's actually very hard and very expensive to do. So right now, you start seeing in the literature people publishing all these data sets with like a million some cells. And it requires a lot of infrastructure to run.
Whereas with things like the TRAP analysis, we can do very easily on just any desktop or laptop computer. So this is our two dimensional UMAP representation that is generated by the algorithm that we just ran, this archetypal analysis algorithm followed by the fancy network construction.
And now we compare it to the original one. Obviously, this gives us much better resolution. So we can feed normal PCA to the algorithm or we can do what we did, which is this reduction function, which does an SVD follow up by ortho projection, which removes all these non-informative genes.
If you plot them side by side, they look the same. But as the algorithm runs, there's this propagation of error that you want to minimize, right? So if you were to give it just the raw PCA, it wouldn't work quite as nicely.
OK, so there's a lot of detail here that you can read through, if you're interested in how the algorithm actually works, references to all the theory and all the papers. But I'll just show you what the end result looks like. We'll do the same thing again.
So this package has a lot of tools to do all the basic analysis stuff that I showed you before, like for example, visualizing marker gene expression. You don't have to do it manually. So here, I'm just I'm just plotting the raw marker gene expression. I need to give the genes we're looking for. Where are the genes?
Oops. I think we lost them. Sorry. Oh, there. Think that was it. OK. So this is the raw expression. And it looks similar to the vanilla UMAP. It's a little easier to read now, because the clusters are better separated.
But now it's much more obvious that there's missing data here, right? There's a lot of missing data, in fact. We can do better than that. So given this network that we've built that we use to generate these much nicer 2D projections, we can actually use them for other things.
Since we have a measure of how similar these cells are in their relationship to one another, this allows us to actually use algorithms to impute the missing data. So one of those algorithms that we use that works very, very well is PageRank.
PageRank you may have heard of before. It's the Google search engine algorithm. And we can actually use this algorithm to propagate information throughout our network. So we give it a list of genes. And we want it to basically do this diffusion-based smoothing using this algorithm.
And what we get when we do that, the alpha parameter controls how much smoothing we do. We can impute the missing data. And here, these are very highly-expressed canonical markers. So it doesn't add that much more information. What it does is actually smooths out gene expression where you don't expect it or don't want it or it shouldn't be there, but it's there because of contamination, noise, or whatever.
But where it really shines is when you have these extremely lowly expressed but still very specific marker genes that are otherwise very, very difficult to visualize. And now we can take that same graph that we built, the new fancy one that uses the Jensen-Shannon distance and all that. And we can put it back into the Leiden clustering algorithm with the same parameters that we used before, but that we originally fed the UMAP KNN graph.
And we can get a much better clustering. So how many clusters do we have here? We have a similar number of them, but now they map better to the clusters that we can see and the clusters that we know actually have distinct identities. And this 5, this one is an interesting one, because it's not actually inaccurate.
There's heterogeneity within the MSN clusters. There's these two populations. The more abundant ones are called the matrix SPNs. And the less abundant ones are called striosome SPNs. And these are actually extremely similar to one another, despite coming from distinct cell populations.
So the Leiden algorithm using our action and graph structure was actually able to identify this. And we can increase the resolution parameter so these two separate. And so when we do the identification of these archetypes, these extreme data points, we can assign-- we identified-- I don't know how many we identified.
So we told it to look between two and 30. So first, it looks for two, then three, then four and five. And in total, it finds 464 distinct points. It then asks, OK, which ones are redundant? It removes them. Which ones are not reproducible? They're probably not actually vertices. So it removes those two.
Which ones are trivial? There is a single cell out there somewhere. Removes those. And then we don't actually want 464 clusters. That's what's left, 426. That's still too much to deal with. We can just give it some threshold by which it says, OK, these two are very similar. So all the cells that are most similar to this data point, just pretend like they're part of the same cluster, right?
And this gives us 23 groupings of cells. Now, notice that these two-- this is effectively just another clustering, like the Leiden. But notice it looks very different. Now, in some cases, it recapitulates entire cell types, just like Leiden does. It matches.
In some cases, it gives you subcellular resolution. There might be a cell type where you have different transcriptional phenotype driven by some very subtle signal. It can capture that. But where it shines is it can actually capture things that are not related to biological variance.
So this is a really good way of identifying groupings of cells that have, for example, a result of some technical artifact. Maybe they're undersequenced, they're very contaminated. It also gives you things that Leiden won't tell you. Like, you might have a grouping of cells that is, I don't know, characterized by increase of stress markers.
So there's many different clustering algorithms out there. And the best way to use them is by using several of them. Because they all identify different things. There's algorithms like Louvain. What is it? HDB scan. And they all identify different things in your data.
And they're very useful for cleaning the data or identifying cell types or finding interesting pathways that are being activated in some cell state. So when we do the analysis and we do the data curation, we apply a lot of these algorithms, because they're good at identifying a lot of things.
And the better and more informative your underlying graph structure is, the better you can define the relationship between different cell types, the better these things work. OK. And then we can use the same underlying graph structure to do a more refined search for marker genes.
And what this does is it gives you similar results to just doing a normal statistical test. But the results are much more specific. Because now, you can cross validate the enrichment by asking, OK, for this cell, we have this marker that's very highly enriched. But is it also enriched in all the cells that are near me, right?
And we can pull out-- so we have all the canonical markers here. These are oligodendrocyte. We can see MBP, PLP1. We can see that these are D2SPNs. So we have expression of Drd2. Over here are astrocytes. So we have APOE, GPC5.
Now we can pull one of these out and say, OK, not all of these are like canonical markers. Some of these are de novo markers. How accurate are they? How good is our algorithm at finding new markers? Turns out it's very, very good.
If we were just to pull out the top nine, some of these genes are genes we may never-- if you work with oligodendrocytes, you may have never heard of these genes. But they're extremely specific markers, right? And what it will do is it'll give you a ranking of every single gene in the transcriptome ranked by specificity for each individual cell type.
And that's very powerful for doing de novo marker identification. So now we're going to load here a list of canonical markers for striatal cell types. And this is not a very extensive list. It's not a very good list. It's just four genes, two genes, eight genes that we handpicked from the literature.
It's not a very rigorous list. But sometimes this is all the information we have. Maybe we're investigating a part of the brain that just isn't very well characterized or we're looking at cell types that we know only have one or two good markers from cell surface receptors.
What can we do with this in our data? Well, this marker set contains 16 populations and between five and 10 or 15 markers for each. And these are the cell types that I'm expecting to see. With our high resolution decomposition, with this underlying network that gives us cell-to-cell similarity metrics, and with our ability to impute missing data, we can also automate the process of annotating cell types with very high accuracy.
So I'm going to give this function our object containing our counts and our metadata. And I'm going to give it this list of marker genes. And then I'm going to plot the labels that it thinks each cell belongs to. And what it's going to do, it's not going to annotate the whole cluster, which will be very efficient, but also less accurate.
Because we would have to give it a perfect cluster. No, it's going to go and ask, OK, what is the enrichment of these marker genes in every single cell in the data set? And it's going to annotate every single cell individually. And this is what it gives us.
And this is 100% correct, to my knowledge. The transparency is a measure of confidence. So if it's less confident, that'll make it more transparent. And if we had more markers-- ideally, you want many more markers. You get more confidence. And of course, there's heterogeneity. I mentioned that the striosomes are different than the matrix subtype of the SPNs.
And in fact, the canonical markers are actually much more highly expressed in the matrix, which is why it says, OK, well, these are probably also D1, D2 SPNs. But they're not as enriched for the markers you gave me. And what's nice, because we know we have that network that tells us how similar cells are, we can use that to correct labels if we have cells that are contaminated.
So we can run, for example, a label propagation algorithm on our data that says, OK, here are the original labels. Here are the confidences of the labels. Correct the label. So if in this cluster of oligodendrocytes, you have a point there labeled astrocyte and it says, OK, well, this point is labeled astrocyte because it's enriched for these astrocyte markers relative to the cells around it.
But the 20 cells most near to me are all oligodendrocytes. I'm probably an oligodendrocyte as well. So it also permits this error correction mechanism. So far, we've only looked at a single sample. And I think I have the time to finish this.
What happens if we have multiple samples? Especially when it comes to single cell data, this is where things get a little complicated, because batch effects are ubiquitous. Especially when you're dealing with human data, it's very, very difficult.
So I'm going to show you a very extreme example, like one of the worst-case scenarios, and how we can actually fix that. So same thing. I'm going to load this data. This is actually human Huntington's disease post-mortem tissue data. This is real data. This is published data.
I want to show you how disgusting it looks. And you'll be amazed that this is something that could actually be published. OK. So I'm going to run-- I'm going to close it out. I'm not going to spoil it. I'm going to run our algorithm on the data, just as I did before.
And we're going to see how our data looks. So here, I just subsampled-- what is this, 4,500 cells I think? From I don't even remember how many people. OK. It's running the algorithm. It's building the network. Now it's feeding the network to the UMAP algorithm.
It's giving us our 2D and 3D visualizations. I think I showed you the 3D visualizations. OK. So each one of these colorings, each label is a different person. Notice they have basically no overlap.
You'd be more surprised to find that not only are the same person, these are the same cell type or the same family of cell types. There are two types of SPNs here. They're very similar. And they don't overlap at all. This is an extreme batch, in fact. This is worst-case scenario.
But we can actually correct for this. And there's many different ways to do that. So one approach is very brute force-- so it is very popular also-- is this panoramic stitching of the different data sets together. And I took this figure from a method called Scanorama . There's another paper called Batchelor.
And these basically implement an algorithm called MNN, Mutual Nearest Neighbor, which is the same algorithm that our package uses to generate the cell-to-cell similarity matrix. And what it does is it looks at each data set and then asks, OK, in the other data set, what are the cell types most similar to me?
And it does that for every data set and then it drags them all together. So this will also take a sec to run. Actually, don't even need to run it. I'm going to run it anyway, because some of the things downstream might rely on this.
So the algorithm is going to stitch the data sets together. It's actually going to reduce them first using PCA. And then it's going to stitch them together. And then we're going to feed that output into the ACTIONet algorithm. And then we're going to plot them. The first line is the one that's going to take the longest to run.
So I'll ask again if anybody has questions while all this goes. And again, as with everything else, there's details and references to all of these things in the document, if you want to pull it from the GitHub and go through it. Give it a sec. It's almost done.
Now this looks very, very different. This is the brute force method of getting rid of everything that isn't-- all the batch effect associated variance. Now, what's interesting about this is you don't give it a covariate, right? It doesn't know, what are the sources of unwanted variation I want to remove?
It just guesses based on the similarity of data points across data sets. And even that is sufficient to do this kind of correction. This works better. I'll say all of these methods work better the more data you have. If you have only one cell type in one data set and only one cell type in another data set and they're not the same cell type, it's still going to ask the same question of which cell type is most similar to me.
But it's not going to be the same cell type. So it's actually just going to smear the data. So the more heterogeneous your data is-- and ideally, the data has to have something in common-- then the better all these algorithms work.
Now, we developed an algorithm that does the same thing. But in this case, it's a little faster. And you can also give it the exact variation that you want to remove. So in this case, I want to tell it, I know that the effect is driven by the different samples that I'm calling batches.
And I want to keep all the other variation that is not associated with that. So only the variation that is driven by the fact that these samples came from different sequencing runs, different people, that's the only thing I want to remove.
So we're going to run that real quick. And we're going to compare it to how that works compared to the brute force panoramic stitching method. So what this does is it runs our decomposition, where we remove out all the uninformative genes. And then it basically asks, OK, what does this one sample have that is different from every single other sample? And it does that for every sample.
And then for each sample, it removes that from every other sample. So it forces all the clusters together by removing the batch-specific variation. And this is what we get. So they're both clean. They're both good. They both give correct results.
As the data gets uglier, some methods work better than others. And there's other fancy methods that use deep learning approaches to try to estimate the true underlying expression profile. There's other really brute force approaches that I think personally they're quite destructive to the data. But they're super, super fast.
So there's many different types of batch correction algorithms out there. And even an effect so extreme as this can be easily corrected in the data. All right, and then we talk about-- let's go to dealing with differential expression and how we do it in single cell and how it differs from TRAP or bulk RNA seq.
So again, we did the same thing. We're going to load our-- this is R62. So this is from the same mice that I showed you for the TRAP example. Exact same samples, except this is single cell. And I want to treat it the same way. I'm going to remove all the unabundant genes.
I'm going to just do library size normalization. I'm going to log scale them. And then I'm going to pull out the count matrix. And I'm just going to run a simple statistical test. I show how to implement this by hand in the first tutorial that's on the GitHub page.
This is just a function that does it really, really fast. And it's just going to do the Wilcoxon rank sum test, which is going to tell me what the differentially-expressed genes are. And then I'm going to pull out the results specifically for HD versus control. Right?
So that's going to be quite fast. And the results look like this. I'm going to pull them up top. So these are just the top genes ranked by p value. And you'll notice there's something wrong here. These p values are incredibly unrealistic. And I also have genes here that I don't like, because I know that they're just false positives.
So the problem here is-- and this is what people used to do, maybe up until two years ago in single cell papers-- is that they would treat each cell as a replicate, which is sliced somewhere between incorrect and fraudulent.
Because I have three mice in this data. My n is 3, not-- what is it here? I don't know, 10,000 or something. So I'm not getting accurate results. And on top of that, because I have these genes that are super highly variable, these are going to dominate the signal.
Also, if I have a mouse with 5,000 cells and then two mice with like 100 cells, one mouse is going to dominate everything. And it's still going to give me these incredibly small p values, because it thinks I have this massive sample size, when I really don't.
So what people have done now-- the standard is to bulk your data. So now that I have all the cell types annotated, I'm going to separate the data out. And then for each sample and for each individual cell type, I'm just going to take the sum or the average across all my cells.
And I'm going to have one vector representing the gene expression for that one cell type out of one sample. And the end result is that you now have a data that looks identical to the TRAP. So if I were to pull out the astrocytes-- if I had a trap astrocyte experiment and I were to pull out the astrocytes for my single cell data and pseudobulk it, the two data sets would be nearly identical.
With this, the advantage here is that I can do this now for every cell type, right? I can do a lower fidelity pseudo TRAP for a bunch of different cell types at the same time. And we're going to do exactly the same thing. We're going to generate our-- probably we can run this. I just showed it to you.
Just make our metadata that we're going to feed into our differential expression package, whichever one we're going to use. In this case, again, we care only about the condition, which is the genotype. I should use the same name, for consistency.
And now we could do something like run SVA again. But we don't really need to do that. Even though we're now treating the data like it's bulk-- I have 3 minutes, I'm just going to run through this. Even though we're treating the data like it's bulk, we still have that single cell level information.
So we can do actually do something really fancy. We can compute the variance across each sample and for each cell type-- the single cell level variance. And now I can use that variance as the weight for the model fitting for differential expression analysis.
And I'm specifically going to only use the weight of the genes that are stable. So this is to minimize the amount of false positives I'm going to get. So I don't want things that are super highly expressed, because there's some noncoding nuclear nonsense that I know is just super variable across nuclei. Right?
But I'm going to use this information. And I'm going to feed this as weights to a model. I'm using the package Lemma, which is just going to fit everything to a linear model. And it's going to use that single cell level variance to anchor the model.
And I'm going to get results that are much more reasonable, right? So these p values are more in line with what I would expect than n equals 3. And yeah, we can just look at the whole thing. And at the top are things that I know are associated with Huntington's disease phenotype, in this case.
And then now that I have this, I can do exactly the same types of further downstream analysis of the TRAP data. I'll do the pathway enrichments or the feed. It's now in a nice format that I can feed into package like WGCNA and all the sorts of network analysis stuff. And I can treat the data identically across both assays.
I can even compare them directly across both assays, which we often do. So I think we're out of time. So that's all I have. Does anybody have any questions? If anybody has questions after this as well-- somebody has a question I saw. Before I answer the question, I will say, feel free to email me or reach out if you have any questions about these tutorials or about single cell data analysis in general or about our packages.
PRESENTER 1: Do you have time to answer a question right now?
SEBASTIAN PINEDA: Yeah, I'm not in a hurry. Other people might be.
AUDIENCE: Hi, great talk, thanks. Very informative. I have two questions about RNA seq. So our lab has done single nucleus RNA seq between two neuron types. And basically we're just trying to find which genes are most differentially expressed.
And so we have all these log fold changes. And something my PI wants to do is to unlog them, because it's much stronger to show value change of eight versus 64 than like three and six, which is the log. So can you do that or is there problems from like the log fold shrinkage in the preprocessing that would bar you from doing that?
SEBASTIAN PINEDA: The log fold shrinkage is mainly because differences between two genes might be extreme. And it might prevent you from detecting a change. But if you're just looking at one gene, a log fold change of-- it depends on what log scale you're using. But a log fold change of 1, that's twice as much.
So that's significant enough. Yeah, you can choose to represent whatever scale you want. I don't think it's a problem. If you think it conveys the message better, I think that's fine. It's mostly just a thing of convenience so that when we're looking transcriptome wide, the values are easier to interpret when they're more or less on the same scale.
AUDIENCE: OK. And then the second question is, so in the RNA sequencing machine basically, you take the RNA from a nucleus and you convert it into cDNA. And during that step, it's amplified, right? So one RNA is turned into many cDNA. So should I be worried that multiple read counts are actually coming from one RNA that's been amplified a bunch of times or is that taken care of?
SEBASTIAN PINEDA: Good question. When we do this type of single cell assay, that barcoding scheme I was talking about earlier-- the UMI is actually intended to prevent exactly this. So when you have-- yeah, you have to amplify the library to maximize your detection. And when you do that, you're going to have many different transcripts that came from the same-- or you can have very many applicants that came from the same original transcript. But they're all going to have the same unique molecular identifier barcode. So what the aligner does is it tosses every duplicate, so that each unique UMI is only aligned and counted once.
AUDIENCE: OK. So if there's like 2,500 counts for a gene, each of those comes from a unique barcode, which corresponds to one RNA.
SEBASTIAN PINEDA: In theory, yes. If the barcode was not sequenced or didn't change somehow, then yes, they should all be unique.
AUDIENCE: All right. Thank you.
PRESENTER 1: All right, Sebastian. This was great. Thank you so much for presenting for us today.