Cell Population Effects in a Mouse Tauopathy Model Identified by Single Cell Sequencing

Neurodegenerative disorders are complex multifactorial diseases that have poorly understood selective vulnerabilities among discrete cell populations. We performed single cell RNA sequencing of whole hippocampi from the rTg4510 mouse tauopathy model, which expresses a P301L MAPT mutation at two time points—before and after the onset of pathology. One population of neurons showed a robust size reduction in both the young and the old transgenic animals. Differential expression of genes expressed in this group of neurons suggested an enrichment in granule cell neurons. We identified genes that characterize this population of neurons using Pareto optimization of the specificity and precision of gene pairs for the population of interest. The resulting optimal marker genes were overwhelmingly associated with neuronal projections and their expression was enriched in the dentate gyrus suggesting that the rTg4510 mouse is a good model for Pick’s disease. This observation suggested that the tau mutation affects the population of neurons associated with neuronal projections even before overt tau inclusions can be observed. Out of the optimal pairs of genes identified as markers of the population of neurons of interest, we selected Purkinje cell protein 4 (Pcp4+) and Syntaxin binding protein 6 (Stxbp6+) for experimental validation. Single-molecule RNA fluorescence in situ hybridization confirmed preferential expression of these markers and localized them to the dentate gyrus.

Tauopathies are a class of neurodegenerative disorders that include, amongst others, 2 Alzheimer's disease, Pick's disease, frontotemporal dementia, and progressive 3 supranuclear palsy. These are all characterized by the misfolding and accumulation of tau 4 protein. Key molecular and cellular mechanisms of the disease, such as selective cell 5 vulnerability, remain poorly understood. Single-cell RNA sequencing (scRNA-seq) offers 6 a means to determine cellular heterogeneity by profiling thousands of individual cells 7 (Lake et al. 2016;Habib et al. 2017;Zhong et al. 2018;Lake et al. 2018;Macosko et al. 8 2015) in order to detect changes in transcriptionally similar cell populations. 9 Comprehensive characterizations of the cell types present in various mammalian organs 10 such as the dorsal root ganglia (Nguyen et al. 2017;Usoskin et al. 2014), the brain (Zeisel 11 et al. 2015;Tasic et al. 2016), the kidney (Park et al. 2018), and intestinal organoids (Grün 12 et al. 2015) have been published. In the context of Alzheimer's disease, single cell 13 sequencing of immune cells has revealed the existence of a "disease-associated 14 microglia" state in Alzheimer's disease (Keren-shaul et al. 2017). In a larger study of 15 Alzheimer's brain tissue, an over-representation of female cells was present in AD 16 subpopulations and myelination-related processes were perturbed in multiple cell types 17 (Mathys et al. 2019). 18 19 We sought to assess cell population changes in the mouse tauopathy model, rTg4510, by 20 profiling thousands of individual cells at two different time points. rTg4510 animals express 21 a mutant form of human Tau (P301L) and exhibit behavioral deficits, neuronal loss, 22 microgliosis and neurofibrillary tangles (Ramsden et al. 2005). Unbiased clustering of 23 single neuron transcriptional profiles identified a sub-population of neurons that was 24 depleted in the rTg4510 animals when compared to their littermate controls at both a 25 young age, i.e. before detection of Tau pathology, and at an older age when the tauopathy 26 was manifest. This cluster of neurons could not be assigned a neuronal identity based on 27 marker genes or gene ontology. To gain insight into the identity of this cluster, we identified 28 genes that were uniquely expressed together in the neurons of the cluster of interest, but 29 not co-expressed in the neurons of the other clusters. Optimal gene pairs, according to a 30 Pareto optimization based on sensitivity and precision criteria were then identified. We 31 validated one of the gene pairs by single molecule fluorescence in situ hybridization. The 32 set of optimal marker genes identified in this manner was associated with neuronal 33 projections. 34

35
To study the effect of Tau P301L overexpression on cellular identity we performed single-36 cell RNA sequencing using the Dropseq method (Macosko et al. 2015) on cells isolated 37 from the hippocampi of three rTg4510 and four control animals at a young age (4-6 weeks 38 old), and four rTg4510 and 5 control animals at an older age (32-40 weeks old  3624 from the young rTg4510 animals, 4776 from the old control animals and 5523 from  1 the old rTg4510 animals. 2

A population of neurons is depleted in the transgenic animals 3
To identify neuronal populations that differ between the rTg4510 and control animals, we 4 first classified cells with a set of published markers for cell types of the nervous system 5 (see Methods). Of the 19,539 cells remaining after quality control, 7600 were identified as 6 neurons, 3968 as oligodendrocytes, 3882 as astrocytes, 3437 as microglia and 652 cells 7 were of a non-specific glial identity (Fig. 1). Of the 7600 neurons, 2596 came from the 8 young control animal, 1406 from the young transgenic, 1826 from the old control and 1772 9 from the old transgenic. We focused on the neuronal population and removed from the 10 analysis those genes that were not expressed in at least ten neurons. To identify neuronal 11 types that differed between the rTg4510 animals and the littermate controls, we performed 12 dimensionality reduction and clustering using the shared nearest neighbor technique 13 implemented in the Seurat package version 2 (Butler et al. 2018). In doing so, 26 neuronal 14 clusters were identified (Fig 2A). Of these, 7 clusters were major clusters, together 15 comprising over 50% of the all neurons (Fig 2A, clusters 0 to 6 and Supplementary Table  16 I). We evaluated the robustness of these seven major clusters on simulated data 17 constructed by adding normally distributed noise to the original count matrix. We then 18 renormalized and rescaled the data and re-clustered the cells. We computed the Jaccard 19 index for each of the new clusters (see Methods). Doing so, four of the abundant clusters 20 were robustly identified in the simulations as evidenced by a mean Jaccard index greater 21 than 0.5 ( Fig 2B, clusters 0, 1, 2, and 4). 22 23 We focused our attention on cluster 1 that underwent a highly robust change between the 24 control animals and both the young and old rTg4510 animals. In both the young and old 25 control animals the proportion of neurons in cluster 1 was lower in the rTg4510 animals 26 than in the control animals in each of the animals we analyzed ( Fig. 2C right panel). We 27 confirmed that the cell population differences were robust to noise by generating 28 simulated datasets with the addition of normally distributed noise to the single cell 29 sequencing counts matrix. In every simulation generated, the proportion of neurons in 30 cluster 1 was lower in the rTg4510 samples than in the control samples (Fig. 2D). We 31 further confirmed that the cell proportion differences were not an artifact of the clustering 32 method by applying a second clustering method from Monocle2, density peaks clustering 33 (Qiu et al. 2017), which resulted in a similar decrease in a cluster of neurons in the 34 rTg4510 animals ( Fig. S1 and supplementary information). 35 36 Using the same method to analyze the microglia, we found a sub-cluster of cells that was 37 almost exclusively present in the old rTg4510 animals. This cluster expressed genes 38 related to microglial activation such as Aif1, Cd68, Il1b, Ptprc (aka Cd45) and Tnf. Thus, 39 our method confirmed the higher abundance of activated microglia in the old rTg4510 40 animals as previously reported (Lee et al. 2010). 41 42 The reduced cluster 1 abundance in the rTg4510 animals was observed even before tau 43 inclusions were detectable. Therefore, the gene expression profiles of that cluster could 44 potentially be informative of the earliest molecular processes underlying tau pathology. 45 We thus sought to identify the genes that define cluster 1. We attempted to assign a 46 4 specific neuronal cell identity to this cluster by inspecting the expression of known marker 1 genes for known neuronal cell types. This approach failed to identify a specific recognized 2 neuronal cell type indicative of the cluster. Therefore, we sought to characterize this 3 cluster by identifying sets of genes that, when taken together, precisely and specifically 4 describe the neurons of cluster 1. Differential gene expression analysis between the 5 neurons of cluster 1 and the neurons of the other clusters can provide information as to 6 the identity of the neurons in cluster 1. This approach however is inaccurate when the 7 number of cells being compared is imbalanced; nor does it consider the potential synergy 8 of genes as markers of a group of cells i-e that two genes with modest specificity and 9 precision for a cellular population on their own are highly specific and precise for a 10 population of cells when combined. Using these two parametersspecificity and 11 precisionwe devised a method to identify optimal sets of genes that described the 12 neurons of cluster 1. Briefly, for each gene, we compute two measures that together, 13 define how useful the marker gene is: the precision and sensitivity. A perfect marker would 14 identify all of the neurons of cluster 1 (sensitivity of 1) and only the neurons of cluster 1 15 (precision of 1) (see Methods). We plotted each gene as precision vs sensitivity and use 16 a Pareto optimization to find sets of optimal markers, i.e. those genes for which no other 17 gene has both a greater precision and sensitivity. Doing so, we identified nine optimal 18 single markers (Fig. S2). To identify optimal pairs of markers, we repeated the procedure 19 using a greedy approach, testing only the pairs that included one of the optimal single 20 markers. This resulted in 19 unique pairs of optimal markers (Fig. 3A). As expected, pairs 21 of genes used as markers had increased precision and sensitivity over single genes. The 22 pairs of optimal markers making up the pareto front contained 18 unique genes. These 23 genes were overwhelmingly associated with neuronal projections; 10 of these 18 genes 24 are known to be implicated in neuron projection. These include Gap43, Calm1, Ppp3ca, 25 Gria2, Ncdn, Pcp4, Nrgn, Olfm1, Plk5 and Penk. Of the other 8 genes, 4 were associated 26 with other neuronal related functions or localization: Scg2 (neuron components), Stxbp6 27 (SNARE complex), C1ql3 (synapse organization) Trpc6 (neuron differentiation). Finally, 4 28 genes had either no known function (Snhg11) or had main functions not specifically 29 related to neuronal function: Ppp1r1a and Rasl10a (signal transduction), Rbp4 (glucose 30 and retinoid process) 31 32 To validate the computational method and better characterize the expression of these 33 optimal markers in neurons of the brain, we selected a marker pair for experimental 34 validation. Given that each of the pairs are optimal, we devised another metric for selecting 35 the marker pair for validation. We computed, for each pair, the proportion of double-36 positive cells in our single cell transcriptome data for each sample type. We also computed 37 the ratio of this proportion between the old rTg4510 and old control samples and between 38 the young rTg4510 and young control samples in our single cell data (Table I). We 39 selected the marker pair with the highest fold change between the number of double-40 positive cells in the old rTg4510 and the old control samples, but limited our selection to 41 the marker pairs for which at least 5% of the cells were double positive in each of the 42 samples (old rTg4510 and old control) (Table I). Based on these criteria, we selected the 43 marker pair Stxbp6 (Syntaxin binding protein 6, also known as amysin) and Pcp4 (Purkinje 44 cell protein 4, also known as Pep19) for further characterization. In our single cell data, 45 14% of the old control cells and 5% of the old rTg4510 are double positive for Stxbp6 and 46 Pcp4. These proportions of cells were also observed in the young animals; 16% of the 47 young control cells and 5% of the young rTg4510 cells are double positive for Stxbp6 and 1 Pcp4. In general, the proportion of double positive cells for each marker was similar when 2 comparing the same genotype across time (young vs old rTg4510 or control) but differed 3 when comparing the genotypes at the same time point (old control vs rTg4510) (Table I).  4 Importantly, while the number of double-positive cells differed between the old rTg4510 5 mice and their littermate controls, the expression level of either gene was not significantly 6 different between the two genotypes at either time point (Fig. 3A). Therefore, this metric 7 identified exclusively cell population differences without an effect on the transcript level in 8 the experimental and control cell populations. 9 10 We next looked for brain regions which expressed both Stxbp6 and Pcp4 by interrogating 11 the Allen Brain Institute mouse atlas (Lein et al. 2007). Both genes are enriched in the 12 cells of the dentate gyrus (Fig. 3B). This enrichment in the dentate gyrus was also 13 observed for at least one of the genes in every gene pair identified as an optimal marker 14 pair by Pareto optimization, strongly suggesting that the cluster of neurons of interest was 15 physically located in this region. We compared these (314) genes to the Mammalian Adult 16 Neurogenesis Gene Ontology (Mango) (Overall, Paszkowski-Rogacz, and Kempermann 17 2012) and computed the enrichment of each of the terms in the ontology. This analysis 18 identified the ontology term "Granule cell neuron" consistent with enrichment of this 19 population in the dentate gyrus. 20 21 To validate the existence of the population of neurons found by single cell sequencing, 22 we performed single molecule RNA fluorescence in situ hybridization (smFISH) of both 23 control and rTg4510 mouse cryosections. We concentrated our analysis on the 24 hippocampus, specifically the dentate granule cells which were enriched with the two 25 marker genes. We quantified the proportion of double positive (Stxbp6 + /Pcp4 + ) cells in 26 each of the samples (Fig. 3C). As predicted by the computational approach based on the 27 single cell transcriptomes, a greater proportion of double positive cells were found in the 28 control sections than in the rTg4510 sections (Fig. 3D). 29 30 Given that the rTg4510 mouse model requires the breeding of two strains each carrying 31 a single transgene, we wanted to ensure that our results were only observed in the 32 presence of both transgenes. The P301L MAPT mutation is under a tetO promoter that is 33 activated in the brain when the animals are bred to CamK2a-tTa animals. We performed 34 single cell RNA sequencing of old animals carrying either of the two alleles. Using single-cell sequencing, we found a population of neurons that is less abundant in 3 the rTg4510 animals than in their littermate controls and developed a computational 4 method to identify markers of this population. This method identifies combinations of 5 genes that together are optimally precise and sensitive towards the population of cells 6 under study. This method can be applied to the characterization of any cluster of cells 7 identified by single cell sequencing. Applying our method to the characterization of cluster 8 1, we identified Pcp4 and Stxbp6 as one pair of markers that define this cluster. Pcp4 9 expression has been shown to be decreased in the brain of patients with Alzheimer's 10 disease and Huntington's disease (Utal et al. 1998) and in the brain of alcoholic patients 11 (Iwamoto et al. 2004). Out of the optimal markers for the cells of cluster 1 identified by the 12 pareto front, Gap43 ( we also observed a parallel increased abundance of another neuronal population 19 designated cluster 0 in the rTg4510 animals. Comparing these two populations of neurons 20 suggested that the neurons of cluster 1, those that are less abundant in the rTg4510 21 animals, are more mature than those of cluster 0. Together, these results associate the 22 behavioral deficits observed in the early stages of rTg4510 progression with a neuronal 23 maturation deficit or the emergence of a precursor population. We confirmed the finding 24 of Pcp4 and Stxbp6 as marker genes for the reduced population by single molecule RNA 25 FISH. This localization suggested a resemblance of rTg4510 mouse model to Pick's 26 disease, a tauopathy with selective vulnerability of dentate granule cells. 27 28 The two marker genes for the cluster that decreased with disease progression are 29 exclusively co-detected in the dentate granule cells. The dentate gyrus is a major area in 30 which neurogenesis takes place in the adult mammalian brain (Eriksson et al. 1998) even 31 at older ages (Kempermann, Kuhn, and Gage 1997) and is thought to play a role in 32 hippocampal-dependent learning and memory (Gould et al. 1999;Shors et al. 2001;33 Snyder et al. 2005) The effects on neurogenesis (reviewed in (Rodríguez and Verkhratsky 34 2011)) and cell cycle re-entry (reviewed in (Yang and Herrup 2007)) in Alzheimer animal 35 models are complex. Most of these findings pertain to models of beta-amyloid deposition. 36 Much less is known about neurogenesis in the tauopathies. Studies of this topic must 37 address the balance between proliferation and differentiation. This balance appears to 38 shift toward proliferative cells in the Alzheimer brain (Stopa et al. 1990;Hock et al. 2000). 39 Compared to controls, Alzheimer patients' brains increase the expression of doublecortin, 40 polysialylated nerve cell adhesion molecule, neurogenic differentiation factor and TUC-4 41 (Jin et al. 2004). Doublecortin and TUC-4 are associated with neurons in the neuro-42 proliferative zone of the dentate gyrus suggesting that neurogenesis is increased in the 43 Alzheimer hippocampus, where it may give rise to cells that replace neurons lost in the 44 disease (Jin et al. 2004). The localization of tau pathology to these regions suggests that 45 the tauopathy may be responsible for effects of neurogenesis, which would result in the 46 more immature profile we observed here. Furthermore, the entire cluster of genes in the 1 Pareto front directed toward axonal projections is consistent with another P301L tau over-2 expressing transgenic mouse in which tau occupies the terminal zone of performant 3 pathway (Pooler et al. 2013). We do note the caveat in a recent report the observed 4 phenotype in rTg4510 animals could be the result of the disruption of six endogenous 5 genes caused by the insertion of the Tau and tTa allele (Gamache et al. 2019). However, 6 none of these genes were identified as markers of the population of neurons we identified. 7 Tables   8   Table I. Predicted proportion and fold change of the optimal marker pairs identified. 9 Stxbp6+Pcp4 (in bold) were selected for experimental validation because of the combined 10 relatively high expected proportion in both strains at both timepoints and the expected 11 high proportion fold change at both time points. 12 13 Female rTg4510 that express Tau P301L and rTTA were euthanized with CO2 at young (4-4 6 weeks old) and at old age (32-39 weeks old). The hippocampus was dissected out, 5 digested with papain and mechanically triturated. Cell suspension was layered on 4% BSA 6 and spun at 300g for 10 minutes to remove debris. The pellet was resuspended in PBS 7 and processed for dropseq as described (Macosko et al. 2015). Mice from both strains 8 were processed together from dissection to droplet generation and cDNA generation. 9 Once two animals from each time points were processed to cDNA generation, pre-10 amplification, tagmentation, final library preparation and sequencing were done with all 11 samples in parallel. This whole process was repeated for the other animals. Scientific Instrumentation, Inc. Eugene, OR). Sections that spanned the hippocampal 10 formation from anterior to posterior used for quantitative analysis. Mosaics were captured 11 using an UPlanSApo 20x oil immersion lens, N.A. 0.85 at 1 µm intervals along the z-axis 12 and a pixel array of 800 x 800 in the x-y axes. 13 14 For the single molecule fluorescence in situ hybridization, specimens were imaged on an 15 Olympus IX71 with a 60X water immersion objective. Twelve to fifteen Z-stacks of 0.3um 16 were taken. Images were maximum projected and the smFISH signal deconvolved using 17 a LoG filter. Cells were segmented on the DAPI signal using a watershed transformation 18 and removing small cells using the EBImage package (Huber et al. 2010). smFISH signal 19 per cell was counted as described (Lyubimova et al. 2013). Three slides per genotype 20 were imaged each with five to seven field of views. The average field of view contained 21 91 cells for a total of 1599 cells analyzed for the WT samples and 1413 for the rTg4510 22 samples. The proportion of double positive cells over all the segmented cells per field of 23 view is reported. 24

25
Single cell RNA sequencing data processing 26 BCL files were transformed to fastq files using Illumina's bcl2fastq setting the short 27 adapter reads mask to 18. Fastq files were then transformed to BAM files using picard 28 tools. BAM files were processed as described (Macosko et al. 2015) and in the dropseq 29 cookbook v1.2 to assign reads to cells based on UMI, remove low quality reads, trim the 30 adapters, align the reads and extract the digital expression matrix. This yielded 20900 31 cells. From this digital expression matrix, we removed 38 cells whose total number of 32 counts was less than 100. We also removed 1323 cells for which the total number of 33 transcripts detected was outside 2 standard deviations of the mean. 34

Classification of cell types 35
To classify cells as either neurons, astrocytes, oligodendrocytes or non specific glia, we 36 used a set of published cell type markers (Ko et al. 2013). For each cell, we computed a 37 normalized score to each cell type as described (Macosko et al. 2015). We defined four 38 patterns of gene expression: only the genes of the neuron type are expressed, only those 39 of the astrocyte type, only those of the oligodendrocytes or both those of the astrocytes 40 and oligodendrocytes, representing uncharacterized glia. We assigned cell type based on 41 the maximum correlation to these four patterns. 42 The counts for the cells identified as neurons were normalized and scaled using the Seurat 1 package, regressing out the sequencing run and using a negative binomial distribution to 2 scale the data. Using this transformed dataset, PCA was performed and tSNE projection 3 and clustering were performed using the first 30 dimensions of the PCA. 4 5 Simulations 6 We generated 50 new datasets by adding normally distributed noise (mean = 0, sd = 2) 7 to each count and rounding to zero the resulting negative counts. Using the new count 8 matrix, we repeated the normalization, scaling, dimensionality reduction and clustering of 9 the cells. We computed, for each initial cluster, its Jaccard coefficient in the new dataset. 10

Marker identification 11
To identify sets of genes that define the cluster of interest, we computed, for each gene, ratio of how many of the neurons of cluster 1 were found by this gene over all the neurons 21 of cluster 1 was the sensitivity of the gene. We treated negative markers similarly, 22 selecting negative cells as the ones whose expression is below the 95 th percentile. Of all 23 the genes scored, we select the optimal ones using a Pareto optimization. To identify pairs 24 of optimal markers, we computed, for each pair containing at least one optimal marker (as 25 identified above), its purity and proportion by sequentially "isolating" the neurons, first on 26 the optimal marker, then on the secondary marker.  tSNE representation of the neurons after dimensionality reduction and cluster identification. Clusters labeled 0 to 6 comprise over 50% of all the neurons. B.
Jaccard coefficient of the simulations for the 7 abundant clusters. Clusters 0,1,2, and 4 are robustly identified in the simulations with a mean jaccard coefficient greater than 0.5. C.
tSNE representation of the neurons as in A but split according to genotype and age and highlighting only cluster 1. The proportion of neurons in cluster 1 is shown on the right. The rTg4510 animals have a smaller proportion of neurons in cluster 1 than the control animals. [The mean proportion per animal is presented, the error bar represents the SEM; *t.test pvalue < 0.05; **t.test pvalue < 0.01]. D.
Proportion of neurons in cluster 1 for each sample type in the simulated dataset. Each gray line represents 1 simulation. For every simulation, the proportion of neurons in cluster 1 is smaller in the rTg4510 samples.  In situ hybridization data from the Allen Brain Institute mouse atlas for adult (P19) animals showing that both Pcp4 and Stxbp6 are expressed in the dentate gyrus. C.
Representative images of the single molecule fluorescence in situ hybridization. Inset are scaled at the right of the image. Round arrowheads highlight Stxb6 signal whereas triangular arrowheads highlight Pcp4 signal. Scale bar: 10um. D.
Quantification of the smFISH images. The proportion of cells positive for both Stxbp6 and Pcp4 is higher in the control samples than in the rTg4510.