Correcting differential gene expression analysis for cyto-architectural alterations in substantia nigra of Parkinson’s disease patients reveals known and potential novel disease-associated genes and pathways

Several studies have analyzed gene expression profiles in the substantia nigra to better understand the pathological mechanisms causing Parkinson’s disease (PD). However, the concordance between the identified gene signatures in these individual studies was generally low. This might be caused by a change in cell type composition as loss of dopaminergic neurons in the substantia nigra pars compacta is a hallmark of PD. Through an extensive meta-analysis of nine previously published microarray studies, we demonstrated that a big proportion of the detected differentially expressed genes was indeed caused by cyto-architectural alterations due to the heterogeneity in the neurodegenerative stage and/or technical artifacts. After correcting for cell composition, we identified a common signature that deregulated the previously unreported ammonium transport, as well as known biological processes including bioenergetic pathways, response to proteotoxic stress, and immune response. By integrating with protein-interaction data, we shortlisted a set of key genes, such as LRRK2, PINK1, and PRKN known to be related to PD; others with compelling evidence for their role in neurodegeneration, such as GSK3β, WWOX, and VPC; as well as novel potential players in the PD pathogenesis, including NTRK1, TRIM25, ELAVL1. Together, these data showed the importance of accounting for cyto-architecture in these analyses and highlight the contribution of multiple cell types and novel processes to PD pathology providing potential new targets for drug development. Significance Statement The exploration of the transcriptomic landscape in PD is pivotal for the understanding of the pathological mechanisms of this disease. Nonetheless, little attention has been paid to the influence of cell composition on the transcriptome even though it is known that cyto-architecture undergoes major alterations in neurodegenerative diseases such as PD. Our study signifies that changes in cellular architecture of human substantia nigra in PD have a strong effect on the set of detected differentially expressed genes. By reanalyzing the data and accounting for cell composition, we provide an updated description of deregulated biological processes in PD and nominate a shortlist of PD-associated genes for further investigations.


Introduction
Parkinson's disease (PD) is the second most common neurodegenerative disorder after Alzheimer's disease. In PD, the loss of dopaminergic neurons in the substantia nigra pars compacta and neurodegeneration in other brain areas leads to motor as well as non-motor manifestations (Surmeier et al., 2017;Poewe et al., 2017;Balestrino and Schapira, 2020).
Alpha-synuclein positive inclusions, termed Lewy bodies (LB) and Lewy neurites, are found in the surviving neurons (Fares et al., 2021). Despite the elusiviness of the biogenesis and spreading of these structures, according to Braak's model (Braak et al., 2003), LB-pathology spreads in the PD brain along a caudo-rostral trajectory, correlating with disease progression.
Notwithstanding great progress since its initial description (Parkinson, 1817), the causative factors remain poorly understood. Various environmental, lifestyle, and genetic factors, including rare and highly penetrant variants in a limited number of genes (Bandres-Ciga et al., 2020) and 90 common risk loci (Nalls et al., 2019), have been implicated in its pathogenesis. Still, a big proportion of missing heritability remains.
In parallel to genetic studies, genome-wide expression profiling has been used to characterize alterations in molecular pathways in different brain regions, blood, and other tissues of PD patients (Borrageiro et al., 2018). For example, a recent transcriptomics study reported evidence of differential brain regional vulnerability to PD progression in accordance with the Braak's hypothesis (Keo et al., 2020). Based on its role in PD, the substantia nigra, has been extensively investigated in these genome-wide expression profiling studies.
Further insight into PD pathogenic pathways has been enabled by the aggregation of small-scale, low-powered, and low concordance studies (Oerton et al., 2017) into larger meta-analyses, which has led to the nomination of putative key regulators of disease progression (Zheng et al., 2010;Wang et al., 2019a). This approach has proven to be fruitful, even in the context of a high degree of heterogeneity in the putative causes and severity of phenotypes of the included patients.
Unfortunately, transcriptomic studies suffer from various technical limitations, such as RNA degradation, that affect the different cell types to a variable extent (Jaffe et al., 2017). Furthermore, differences in cyto-architecture originating from biological heterogeneity (due to e.g. age, gender, genetic background), as well as sample preparation can further influence downstream analyses. An even stronger confounder might be represented by cell composition changes induced by a pathology (Srinivasan et al., 2016). This issue is particularly concerning since it is not possible to distinguish between changes in genes that are highly expressed in a cell type whose proportion change and a genuine pathology-related transcriptional deregulation. While the former might be interesting as it may correlate with disease progression, the latter can inform on the molecular mechanism for which therapeutic strategies might be devised. In neurodegenerative disorders and especially in highly affected brain regions, as the substantia nigra in PD, this phenomenon might be even more pronounced.
To tackle these and other challenges, single cell RNA sequencing (scRNAseq) and single nucleus RNA sequencing (snRNAseq) are currently rising in popularity. However, thus far only a limited number of PD substantia nigra samples have been profiled at the single cell level (Smajić et al., 2021) severely decreasing the power to detect relevant differences.
Alternatively, cell proportions can be estimated from bulk transcriptomics data, and then analyses can be corrected for altered cyto-architecture. Recently, several bioinformatic approaches have been proposed to estimate and use these surrogates for the proportions of the cell types, offering the opportunity to exploit the enormous amount of data readily available in public repositories (Shen-Orr et al., 2010;Mancarci et al., 2017;McKenzie et al., 2018;Newman et al., 2019;Wang et al., 2019b;Zaitsev et al., 2019;Dong et al., 2020;Li et al., 2020). 6 In this study, we systematically assessed the transcriptomic evidence for the presence of changes in cell compositions in expression data of nine publicly available Parkinson's related microarrays of the substantia nigra (see Figure 1 for an overview). We conducted the first meta-analysis of these data sets while evaluating the impact of the cyto-architectural alterations on the differential expression analysis. By correcting for these effects, we were able to detect genuine disease-related changes in the transcriptional landscape of the substantia nigra in PD patients. Finally, we explored the protein interactome of the identified deregulated genes and nominated promising candidates for further investigations by exploiting their network characteristics.

Transcriptome data sets download, pre-processing, and gender imputation
We downloaded gene expression data sets from the Gene Expression Omnibus (GEO) using "Parkinson's disease" and "substantia nigra" as keywords. In total we found 10 studies: nine (GSE7621, GSE20333, GSE20292, GSE20163, GSE20164, GSE54282, GSE49036, GSE43490, and GSE42966) analyzed the whole substantia nigra for each patient; one analyzed separately the medial and the lateral part of the substantia nigra (GSE8397) from which we only used the lateral as more affected in PD (Duke et al., 2007). GSE54282 was excluded from analysis because of the small number of available samples. Only probes mapping to an Entrez ID using the biomaRt package version 2.42.1 (Durinck et al., 2005) were kept. When multiple probes mapped to the same Entrez ID, we kept the one with the maximum variance and connectivity using the collapseRows function from the WGCNA package version 1.69 (Miller et al., 2011), resulting in a total of 18,948 genes. We used gene expression as already processed in the original studies (either Robust Multichip Average normalization or the Affymetrix microarray suite), removed outlier samples according to the original publications. Finally, expressions were log transformed, the merged data set was quantile normalized with the normalizeBetweenArrays function from the limma package version 3.42.2 (Ritchie et al., 2015). Since gender information was not always available, we used the massiR package version 1. 22.0 (Buckberry et al., 2014) to annotate all the samples using the top 75% variable genes for the prediction.

DEGs meta-analysis
The DEGs were identified by fitting a linear mixed-effects model (LMM). For each gene, two LMMs were fitted using the lmer function from the lmerTest package version 3.1.2. One LMM accounts for the status and the gender as fixed effects and the different studies as random-effects: Geneexp~Status + Gender + (1|Study), with (1|Study) indicating a one-hot encoding of the study. The other LMM, accounts also for estimates of cell types (Neurons (NEU), Oligodendrocytes (ODC) and oligodendrocytes precursor cells (OPC)) as fixed effects: Geneexp~Status + NEU + ODC + OPC + Gender+(1|Study). We chose to only use a subset of the cell types as correction covariates to avoid the introduction of collinear predictors.
Specifically, we included those cell types whose estimates are only weakly correlated with the estimates of the other cell types in the model but highly correlated with the excluded ones (Extended Data Figure 1-1B). Moreover, we ensured to include cell types whose estimates show changes in opposite directions between PD and CTRL ( Figure 1B). P-values were corrected using the Benjamini-Hochberg (BH) method.

Cell type markers identification from Substantia nigra single cell data
Human substantia nigra single cell data (Agarwal et al., 2020) was downloaded from GEO (GSE140231) and preprocessed with Seurat package version 2.3.0 (Butler et al., 2018) according to the original publication. Cell markers for each cluster compared to all the others were identified from transcripts detected in at least 30% of the available cells, and a log Fold Change higher than 0.5 using the function FindAllMarkers. The clusters were broadly annotated into six cell types using well-known markers: GFAP and GINS3 for the astrocytes; MOBP and MOG for the oligodendrocytes; CSF1R and OLR1 for the microglia; GAD1, GAD2, GABRA, and TH for the neurons; VCAN for the oligodendrocyte precursor cells; and LGALS1 and RGS5 for the endothelial cells.

Cell-type proportions estimation and marker selection
Cell type proportions were estimated by deconvolution using the function MDeconv from the TOAST package version 1.1.2 (Li et al., 2019;Li et al., 2020). We considered six brain cell types: astrocytes, endothelial cells, neurons, microglia, ODCs and OPCs. As cell markers, we initially tested 1) the set of markers identified from the substantia nigra single cell data (GSE140231), and 2) the set of 5,500 markers obtained from the Brain Cell Type Specific Gene Expression Analysis package version 1.0.0 (BRETIGEA, McKenzie et al., 2018). To define which of the two sets, the number, and identity of the markers to employ, we adopted a quality-based heuristic selection. For each marker set, each cell type and study, we ranked the markers on the basis of their association to the 1 st principal component. Next, we calculated the Pearson's correlation among the top 100, and iteratively excluded those showing negative correlations with the majority of the others until no markers with opposite correlations were left. Finally, we selected the top 20 markers for each cell type. We decided to use the BRETIGEA-derived markers in the rest of the analyses since they resolved better into clusters in the correlation matrices (Extended Data Figure 1-2).

Cell-types proportions comparison and statistical analyses
For each study and cell type, statistically significant differences in the estimates of a cell type proportion between the CTRL and the PD substantia nigra were assessed with a linear model controlling for the other annotated covariates, i.e. ProportionCellType~Status + Gender + Age + BraakStaging. We also conducted a meta-analysis to verify the alterations in proportion across the studies. To this aim, a random-effects meta-analysis was conducted in metafor version 2.1.0 (Viechtbauer et al., 2010). For each cell type, the standardized mean difference was calculated, and the associated p-values were corrected by the BH method.

Gene set enrichment analysis
Gene set enrichment analysis (GSEA) was performed using the Fast gene set enrichment analysis package version 1.12.0 (fgsea, Korotkevich et al., 2019), 100,000 permutations, with the Gene Ontology (GO) and the Canonical data set downloaded from the Molecular Signatures Database (MSigDB). For the GSEA on the gene expression, the genes in our data set were first ranked in descending order by the negative logarithm in base 10 of the adjusted p-values multiplied for the sign of the effect size. For the GSEA of the PPI network, nodes were ranked by the betweenness centrality (Zito et al., 2021). All p-values were corrected using the BH method.

Expression-weighted cell-type enrichment (EWCE)
Gene set enrichment for specific cell types was done using the expression-weighted cell-type enrichment (EWCE, Skene and Grant, 2016) package version 0.99.2. To this aim, we first calculated the average expression matrix for each cell type in the substantia nigra using the AverageExpression from the Seurat package using the GSE140231 data set. For each tested list, 10.000 randomly sampled (equal sized) gene sets from all genes in the average expression matrix were used to calculate the p-values, then adjusted by the BH correction.

Alterations in composition of the substantia nigra are heterogeneous
In total, 70 control (CTRL) and 88 PD transcriptomes were put together in our study. After imputing the gender for all the samples (Methods), there were 62 females (27 CTRL,35 PD) and 96 males (43 CTRL, 53 PD), with an average age of death of 75 years (Table 1). To capture the PD induced cyto-architectural alterations, we estimated the proportions of six cell types (astrocytes, endothelial cells, neurons, microglia, ODCs, and OPCs) from the bulk data using computational deconvolution (Methods). This revealed cell type heterogeneity; not only between the two groups, i.e. CTRL and PD patients, but also within each group (across samples), as well as between data sets ( Figure 2A, Extended Data Table 2-1).
Although we observed similar trends between CTRL and PD brains, the alterations in cell proportions did not consistently replicate across the analyzed data sets. To identify consistent alterations, shared among data sets, we conducted a meta-analysis of the estimated cell type proportions. This showed a significant increase in endothelial cells and oligodendrocytes as opposed to a decrease in the neuronal estimates ( Figure 2B, Extended Data Table 2-2). The change in neurons was the strongest, followed by the ODCs and then endothelial cells.

GSEA reveals that the majority of the altered pathways in PD are down-regulated
To explore the effect of the PD-related gene expression deregulation, we conducted a gene set enrichment analysis (

PPI partners of DEGs are enriched for genes expressed in endothelial cells and neurons
To identify the interaction partners of the proteins encoded by the DEGs, we constructed a protein-protein interaction (PPI) network based on the BioPlex, Biogrid, HuRI, IID, IntAct, and String databases (Methods). Starting with the 100 DEGs, we obtained a network comprised of 5,615 vertices and 92,035 edges (Methods, Extended Data Table 3-1). We assessed the expression-weighted cell-type enrichment with EWCE of the entire network and found a significant enrichment for endothelial cells and neurons ( Figure 3A, Extended Data Table 3-2). Enriched pathways (using GSEA) were secreted factors, G-protein-coupled receptors, amine transport, catecholamine secretion, neuronal system, and hematopoietic cell lineage differentiation ( Figure 3B, Extended Data Table 3-3). Moreover, this network was significantly enriched in PD-causing genes (6 genes in the network out of 6, FET p-value 4.6*10 -4 ) and in genes in proximity of the GWAS risk variants (36 genes in the network out of 97, FET p-value 0.029).

Central proteins of DEG PPI partners are known PD-causing genes and genes in proximity to PD risk factors
Within the PPI network of DEG interaction partners, we identified 198 nodes that have both a degree centrality higher than the 95 th percentile (276 nodes) as well as a betweenness centrality higher than the 95 th percentile (281 nodes). Among these, 11 (6%) were encoded by DEGs, hence biased towards higher centrality. The remaining 187 central nodes were defined as top central proteins ( Figure 3C, Extended Data Table 3-1). EWCE analysis revealed that these genes are enriched for endothelial cells ( Figure 3A, Extended Data Table 3-2). The top proteins were also enriched in PD-causing genes (

Discussion
We analyzed the influence of cyto-architectural alterations on the transcriptomic signals from human substantia nigra microarray data from PD patients and CTRL. We demonstrated that a broad palette of alterations in cell composition was present within and between strata as well as across studies. Specifically, a significant decrease of neurons and increase of ODCs in PD with respect to CTRL were the most consistent, but also differences for microglia and OPCs were found. The lack of a universal alteration pattern might be Estimated cell-type proportions were supported by previous and independent studies confirming the reliability of the deconvolution strategy. Firstly, the observed loss of dopaminergic neurons in substantia nigra is a pathological hallmark of PD. Secondly, our observation that endothelial cells were increased in the substatia nigra of PD patients is in line with previous reports (Faucheux et al., 1999;Desai Bradaric et al., 2012), although, it is still unclear whether endothelial cell expansion is a result or a driver of the inflammation status. Indeed, angiogenesis can be stimulated by molecules secreted by astrocytes and microglia in the reactive status (Naldini and Carraro, 2005;Wada et al., 2006) in a vicious loop (Barcia et al., 2004). Thirdly, the increase in microglia in one of the studies can be related to reactive gliosis present in PD that is known to implicate this cell type (Vila et al., 2001). Finally, the observed increase in OPCs and ODCs reflects the skewed neuron/oligodendrocyte ratio in the dissected samples due to the neuronal death. Kelly and Rose, 2010). Importantly, accounting for the cyto-architecture also showed that several of the usually reported pathways (e.g. Parkinson's disease pathway, Oxidative Phosphorylation, alpha-synuclein pathway) might be driven, at least in part, by changes in cell composition rather than the pathological status. A similar observation has recently been reported also in PD prefrontal cortex (Nido et al, 2020), reinforcing the contention that cyto-architecture is an important covariate with a major impact on our ability to understand transcriptional changes in bulk transcriptomics.
As interacting partners of DEGs can have consequent altered functionality, we constructed a network of protein-protein interactors with the detected DEGs. Cell type enrichment analysis of this network corroborated that gene deregulation might have an impact on neuronal biology also through the interacting partners of the DEGs. Further, it showed ramifications in endothelial cell processes unidentified by the differential expression analysis. Similarly, the GSEA reinforced the involvement of the neurons and pinpointed to previously undetected immune response terms. Importantly, PD-causing and genes proximal to GWAS variants, albeit not being differentially expressed, were enriched in this network.
This convergent evidence shows how genetically-identified genes can have an impact on the transcriptional landscape of substantia nigra even when not differentially expressed and supports the relevance of this network. Exploiting their topological characteristics, we nominated key proteins that were central in this partner network. Some of the central proteins have been indeed implicated in the pathogenesis of PD by independent lines of research, such as genes whose variants are causative of inherited forms of PD (LRRK2, PINK1, PRKN) (Kitada et al., 1998, Tolosa et al., 2020, Valente et al., 2004, and/or are the closest to risk variants for developing PD (BAG3, FYN, LRRK2) (Nalls et al., 2019). Furthermore, other hits with compelling evidence supporting their role in neurodegeneration (e.g. GSK3β, WWOX, and VPC;Chan et al., 2012, Wang et al., 2012 corroborate that our approach can be used to identify new potential players in the PD pathogenesis (e.g. NTRK1, TRIM25, ELAVL1).
Our cell-type composition aware meta-analysis comes with some limitations. Firstly, the deconvolution step does not take into account the differences in cell size or in RNA content of the various cell types potentially leading to systematic errors (Zaitsev et al., 2019).
Secondly, as the sum of the estimates of the cell-type proportions is constrained to be 100% for each sample, an increase or decrease in any of the cell types thus has to result in an equal but opposite alteration in at least one of the other cell types. Thirdly, the incomplete annotation of the samples across the studies prevented us from exploring the effect of the age, age of onset, pathology progression, genotype, and Braak staging on the cell proportions and transcriptional processes, all known to influence PD severity (Pagano et al., scRNAseq/snRNAseq studies of human substantia nigra of PD patients will allow additional steps forward in this area of research. In conclusion, our meta-analysis of bulk transcriptomics gives an updated view of the transcriptional landscape in the substantia nigra of PD patients. In addition to leverage a big number of studies and samples similarly to previous works (Zheng et al., 2010;Wang et al., 2019a), we accounted also for the dramatic cyto-architectural alteration induced by PD in this brain area uncovering its effects on downstream analyses.