Abstract
Age is the primary risk factor for many common human diseases including heart disease, Alzheimer’s dementias, cancers, and diabetes. Determining how and why tissues age differently is key to understanding the onset and progression of such pathologies. Here, we set out to quantify the relative contributions of genetics and aging to gene expression patterns from data collected across 27 tissues from 948 humans. We show that age impacts the predictive power of expression quantitative trait loci across several tissues. Jointly modelling the contributions of age and genetics to transcript level variation we find that the heritability (h2) of gene expression is largely consistent among tissues. In contrast, the average contribution of aging to gene expression variance varied by more than 20-fold among tissues with in 5 tissues. We find that the coordinated decline of mitochondrial and translation factors is a widespread signature of aging across tissues. Finally, we show that while in general the force of purifying selection is stronger on genes expressed early in life compared to late in life as predicted by Medawar’s hypothesis, a handful of highly proliferative tissues exhibit the opposite pattern. These non-Medawarian tissues exhibit high rates of cancer and age-of-expression associated somatic mutations in cancer. In contrast, gene expression variation that is under genetic control is strongly enriched for genes under relaxed constraint. Together we present a novel framework for predicting gene expression phenotypes from genetics and age and provide insights into the tissue-specific relative contributions of genes and the environment to phenotypes of aging.
Introduction
Organismal survival requires molecular processes to be carried out with the utmost precision. However, as individuals age many biological processes deteriorate resulting in impaired function and disease. Such increases in the overall variance of molecular processes are predicted by Medawar’s germline mutation accumulation theory (1), which states that because older individuals are less likely to contribute their genetic information to the next generation, there is reduced selection to eliminate deleterious phenotypes that appear late in life (2). This theory also predicts that genes expressed early in life should be under increased selective constraint compared to genes expressed late in life. However, a key challenge remains in both quantifying age-associated changes in biological processes across tissues and identifying how genetic variation influences such changes.
At the organismal level, age-associated changes in the heterogeneity of gene expression between individuals have been observed for a handful of genes in humans (3). In an analysis of gene expression in monozygotic (identical) twins, 42 genes showed age-associated differences in gene expression, suggesting a role for the environment in modulating gene expression with age (2, 3). Similarly, the number of genes with expression quantitative trait loci (eQTLs) detected from blood in 70 year olds declined by 4.7% when they were resampled at 80 years old (4). However, the extent of this phenomenon, both across genes and tissues, remains unclear (5). Age-associated increases in the heterogeneity of gene expression have also been observed at the level of individual cell-to-cell variation; however, only some cell types appear to be impacted (6). In a recent study of immune T-cells from young and aged individuals, no difference in cell-to-cell variability was observed in unstimulated cells, however, upon immune activation the older cells appeared more heterogeneous (7). It is not known why some cell-types and not others may be more likely to exhibit increased cellular variability.
The relationship between the age at which a specific gene is expressed and the force of purifying selection has also recently been explored across a number of species (8, 9). These analyses have broadly confirmed that, on average, genes expressed later in life are under less constraint compared to those expressed early in life. However, how these patterns vary across different tissues and are impacted by genetic variation has not been systematically explored.
Here we set out to understand how aging affects the molecular heterogeneity of gene expression and to model the relative impact of age and genetic variation on this phenotype across tissues. First, using gene expression data from 948 individuals in GTEx V8 (10) we show that age impacts the predictive power of eQTLs, however to varying extents across different tissues and in old and young individuals. Increases in between individual gene expression heterogeneity were associated with these reductions in eQTL power. Using a regularized linear model-based approach to jointly model the impact of both age and genetic variation on gene expression we find that while the average heritability of gene expression is consistent across tissues, the average contribution of age varies substantially. Furthermore, while the genetic regulation of gene expression is similar across tissues, age-associated changes in gene expression are highly tissue-specific in their action. We use this joint model to identify each gene’s age of expression and show that while in most tissues late-expressed genes do tend to be under more relaxed selective constraint, among a handful of highly proliferative tissues the opposite trend holds.
Results
Expression quantitative trait loci exhibit varying predictive power in old and young individuals across several different tissues
To gain insight into how gene regulatory programs might be impacted by aging we analyzed transcriptomic data collected across multiple tissues from 948 humans (GTEx version 8) (10). We hypothesized that aging might dampen the effect of expression quantitative trait loci (eQTLs) due to factors such as increased environmental variance or molecular infidelity (Fig. 1A). To test this hypothesis we first classified individuals into young and old age groups conservatively grouping individuals above and below the median age (55 years old, Fig. S1), respectively, restricting our analyses to tissues with at least 100 individuals in both groups (27 tissues in total, Fig. S2, Table S1). In each tissue we down-sampled to match the sample size of old and young individuals while additionally controlling for co-factors such as ancestry and technical confounders (methods). Of note, a common approach to controlling for unobserved confounders in large gene expression experiments is to probabilistically infer hidden factors using statistical tools such as PEER (11). We noticed that many of the GTEx PEER factors were significantly correlated with sample age, with the top three correlated PEER factors having a Pearson r of 0.33, −0.21, and −0.15 (Fig. S3). To prevent loss of age related variation, we recalculated a corrected set of PEER factors that were independent of sample age (Methods). We then assessed the significance of GTEx eQTLs in the young and old cohorts respectively, comparing the distribution of P-values over all genes between old and young individuals (Fig. 1A 1B). In 20 out of 27 (74%) of the assessed tissues, the P-value distribution was significantly different between young and old individuals with genotypes more predictive of expression in younger individuals in 12/20 cases (Fig. 1C). These results were largely identical when the analyses were performed with the original non-corrected PEER factors (18/27 tissues, Fig. S4). While the GTEx dataset is unique in its wide sampling of participant ages and tissues, we validated our observations in the PIVUS cohort which includes blood tissue from individuals re-sampled at ages 70 and 80 (4). This study previously demonstrated a reduction in eQTL heritability with age supporting our results. We confirmed using our approach that eQTLs were less predictive of gene expression in 80, compared to 70 year old’s (Fig. S5, S6). These results suggest that the predictive power of eQTLs is impacted by the sample age across the vast majority of tissues. Furthermore this effect is more pronounced in older samples compared with younger samples.
Age-associated changes in gene expression heterogeneity impact gene expression heritability
We hypothesized that the overall reduced predictive power of eQTLs in some tissues might be in part due to an increase in expression heterogeneity in these tissues, potentially as a result of increased environmental variance. To test if such an effect would broadly affect expression across all genes in a tissue (Fig. 2A) we calculated the distribution of pairwise distances among individual’s tissue-specific gene expression profiles using the Jensen-Shannon Divergence (JSD) (12, 13) as a distance metric. The JSD is a robust distance which is less impacted by outliers compared to other methods (e.g. Euclidean distance) (13). Comparing the distribution of pairwise differences in transcriptional profiles within distinct age groups allows us to determine if gene expression signatures are more similar among younger individuals or among older individuals.
We compared the mean difference in gene expression distances among old and young individuals as well as the slope of the inter-individual JSD and when grouping individuals into six bins spanning 20-80 years old (see methods, Fig. 2B, 2C). These two strategies yielded highly similar results (Fig. 2C R=0.8) identifying tissues exhibiting increased heterogeneity in both young and old populations. (Fig. S7) The difference in JSD between old and young individuals was also negatively correlated with the results from our analysis of eQTLs across old and young individuals (Fig. S8, R=-0.48, P=0.01) highlighting that tissues with increases in inter-individual heterogeneity were likely to also exhibit reductions in the proportion of variance described by eQTLs.
To expand our eQTL analyses to account for the combined impact of nearby SNPs, we utilized the multi-SNP regularized linear model of PrediXcan (14). This model has the benefit of combining genetic effects across many loci, instead of examining just a single eQTL variant. This combined genetic contribution to gene expression variance results in an estimate of the heritability (h2) for each gene. We applied this model independently in old and young individuals to quantify h2 and found that the average per-gene difference in h2 between old and young individuals was strongly negatively correlated with the difference in JSD between samples (R=0.6, P=9.9e-4, Fig. 2D, Fig. S9). To verify these results we again referred to PIVUS study and obtained cis heritability estimates using GCTA package (15). As expected, we again observed that the heritability of gene expression decreases with age, corresponding with the Predixcan results in GTEx whole blood (Fig. S10). Together these results suggest that across numerous tissues gene expression heterogeneity differs between young and old individuals. This increased expression variance drives a reduction in the average heritability of gene expression across these tissues.
We additionally sought to identify individual genes exhibiting age-associated expression heterogeneity by testing if, after regressing out age-related changes in gene expression levels, the variance of the residuals correlated with age (Breusch-Pagan test). The effect size from this test (βhet) describes the strength and direction of the age related changes in gene expression variance. Using this approach we identified 273 genes with age-associated variance changes (FDR<0.05) across tissues (Fig S11). The estimated βhet values in these genes were overwhelmingly negative (234/279, 84%, Table S3) indicating that the dominant signature was of reduced gene expression heterogeneity with age. A Gene Set Enrichment Analysis (GSEA) of these genes highlighted pathways involved in metabolism, cell proliferation, cell cycle and cell death (Fig. S12). While the proportion of positively heteroskedastic genes was weakly correlated with the transcriptome-wide JSD (p=1.32e-2, Fig. S13), the small number of genes implicated suggests that these metrics are capturing different phenomena.
Cell-type specific age-associated changes in gene expression heterogeneity and the predictive power of eQTLs
While no datasets of the magnitude and scale of GTEx exist for single-cell genomic data, we employed the tool CIBERSORT (16) to deconvolute bulk GTEx blood RNA-seq data into cell-type specific abundances. Assessing the predictive power of eQTLs in old and young individuals in six immune cell subtypes we again found significantly increased explanatory power of eQTLs in younger individuals compared to older individuals, (Fig. S14). Consistent with these analyses a comparison of the JSD in old and young individuals revealed increased expression heterogeneity across these cell types (Fig. S15). We also investigated whether the observed differences in eQTL power and expression heterogeneity might be driven by changes in cell-type composition; however, cell-type composition changes were not reflective of gene expression variance (P=0.2, Fig. S16), suggesting that age associated changes in eQTL power and expression heterogeneity are taking place at the transcript level.
Jointly modeling the impact of age and genetics on gene expression identifies distinct, tissue-specific patterns of aging
A more powerful approach to understand how both genetics and age impact gene expression variation is to jointly model these factors simultaneously. We set out to extend the regularized linear model employed by PrediX-can (14) to incorporate an age factor (Fig. 3A) allowing us to parse apart the individual contributions of genetics ( or h2), age , and the environment , to the expression variance of each gene (e.g. Fig. 3B, Fig. 3C, Fig. S17). We define as all sources of variation not captured by h2 and . Estimates of h2 in our extended model were highly consistent with those in the original PrediXcan approach (Fig. S 18).
Employing our model across each tissue independently we find that average heritability of gene expression was largely consistent among tissues ranging from 2.9%-5.7% with 40% of genes having an h2>10% in at least one tissue (Fig. 3D, S19). Thus, while the variation in expression of many individual genes is strongly influenced by genetics, on average, genetics explains a small proportion of overall gene expres-sion variation. In contrast, the average contribution of aging to gene expression varied more than 20-fold among tissues from 0.4%-7.9% with the average greater than the average h2 in 5 tissues. Among these 5 tissues the expression of 39-54% of genes was more influenced by age than by genetics (i.e. , Fig. S20) and across all tissues 45% of genes had an in at least one tissue. Assessing the tissue-specificity of these trends on a per-gene basis we found while the estimated heritability of gene expression tended to be similar among different tissues, the age-associated component exhibited significantly more tissue specificity (P<2.2e-16, Fig. 3E). We note that the widespread signatures of age-associated gene expression variance that we identify are virtually undetectable when using the GTEx-provided PEER factors. Just 1.84% of the age-associated genes we identify have nonzero age coefficient when using these GTEx PEER factors (Fig. S21). We tested if sex-specific age effects were contributing to the observed age associations as might be ex-pected if changes related to menopause were playing a role (Fig. S22). Including an interaction term between age and sex in our joint model we found that while the age term continued to describe a large proportion of the variance (on average 2.6%), the contribution of the age-sex interaction term was several-fold lower (average of 0.035%, Fig. S23, Table S4). The model incorporating age-sex interactions also showed consistent estimates of variance explained as compared to the baseline joint model (R=0.99, Fig. S24). Our model thus widely expands the utility of the GTEx dataset and exploration of critical biological signatures of aging. Together these results imply that age-associated patterns of gene expression exhibit substantially more tissue specificity than those that are influenced by genetics and among several tissues age plays a much stronger role in driving gene expression patterns than genetics.
Coordinated decline of mitochondrial and translation factors is a widespread signature of aging across tissues
To understand the underlying biological implications of age-associated gene expression changes we applied gene set enrichment analysis (GSEA)(17) to each tissue independently, ranking genes either by the relative contribution of genetics (h2) or aging . Comparing the distribution of P-values from enriched GO-annotations we found that pathways enriched for age-associated variance were substantially more enriched for significance than pathways associated with genetic-associated variance (e.g. Fig. 4A). We found more age-associated pathway enrichment even in tissues for which the average age-associated contribution to gene expression was low (e.g. Pancreas, Fig. S25). This implies that while age-associated changes in gene expression vary widely in their magnitude among tissues, these changes consistently impact critical biological processes. A GSEA enrichment analysis of genes ranked by the tissue-averaged slope of the age-associated trend (βage) highlighted several key aging-associated pathways. Pathways associated with various mitochondrial and metabolic processes and translation were enriched for having –βage values, implying age-associated decreases (Fig. 4B). A single immune pathway, the interferon-gamma response was enriched having +βage values (Fig. 4B). An additional 18 immune pathways were identified as having age-associated increases in gene expression using a more lenient significance threshold (FDR<0.05) (Fig. S26, Table S5). In contrast, no pathways were significantly enriched when genes were ranked by average h2.
To further explore the functional impact of age-associated gene expression changes we compared the of all nuclear-encoded mitochondrial genes (n=1120, (18)), and translation initiation, elongation, and termination factors, across tissues (Fig. 4C, Fig. S27). Genes in these pathways were exceptionally enriched for age-associated gene expression across several tissues. In some cases >10% of the average expression variation of mitochondrial or translation factor genes could be explained by age. βage was consistently negative in these mitochondrial and translation factor genes (Fig. 4D) highlighting that genes in these pathways exhibit a systematic decrease in expression as a function of age. Overall across tissues an average of 36% of all mitochondrial genes (406/1120), and 35% of translation factors (119/337) exhibited age-associated declines, however in some tissues these proportions exceeded 60%. In contrast, the only pathway associated with age-associated increases in expression, interferon-gamma response genes, was largely specific to blood and arterial tissue (Fig. 4C), likely due to the role of this pathway in immune cells. Together these results demonstrate that the coordinated decline of mitochondrial genes and translation factors is a widespread phenomenon of aging across several tissues with potential phenotypic consequences.
Distinct evolutionary signatures of gene expression patterns influenced by aging and genetics
Evolutionary theory predicts that due to the increased impact of selection in younger individuals, genes that increase as a function of age (βage > 0) should be under reduced selective constraint compared to genes that are highly expressed in young individuals (βage < 0), a theory of aging known as Medawar’s hypothesis (1) (Fig. 5A). Several recent studies have demonstrated the generality of this trend across species (8, 9, 19) however the tissue-specificity of this theoretical prediction has not been explored. We sought to test the generality of this trend across different tissues by comparing βagewith the level of constraint on genes, quantified as the probability loss of function intolerance (pLI) score from gno-mAD (20). As expected, across the vast majority of tissues βage was significantly negatively correlated with pLI (Fig. 5B, 5C, S28), in line with Medawar’s hypothesis. However, five tissues exhibited significant signatures in the opposite direction including prostate, transverse colon, breast, whole blood, and lung tissue (P < 10-3). These five tissues still maintained a significant negative relationship after subsetting to genes that are highly dependent on age (, Fig. S29). These tissues with non-Medawarian trends are driven by highly constrained, functionally important genes being expressed at a higher rate in older individuals (Fig. S30). Using dN/dS (21) as an alternative metric of gene constraint yielded highly correlated results (R=-0.72, P=2.5e-5 Fig. S31, S32).
To explore why these five tissues might exhibit distinctive evolutionary signatures of aging we compared the distribution of significant βage parameters between Medawarian and non-Medawarian tissues among different hallmark pathways(22). We found 11 signatures exhibiting significantly increased βage (FDR<0.01) compared to non-Medawarian tissues (Fig. 5D, S33) including DNA-damage, TGF-β signalling, MYC targets, and epithelial-to-mesenchymal transition pathways most prominently. All of these signatures are broadly correlated with cellular proliferation, differentiation, and cancer. Indeed, these five non-Medawarian tissues are also the top five most commonly diagnosed sites for cancer in 2022 (23) (Fig. 5E). To directly investigate cancer signatures in these tissues we quantified the per-gene likelihood of having somatic mutations in tumors using the COSMIC cancer browser (24). GTEx tissues were matched to most representative cancer types for comparisons (e.g. Breast Cancer →Breast Mammary Tissue, Table S6). We found that the per-gene age of expression (page) was significantly correlated with mutation frequency (i.e. mutational burden) across several tissues (Fig. 5F, S34) with the 5 non-Medawarian tissues exhibiting some of the strongest signatures (P<10-4). These results highlight that gene expression patterns in tissues and cell-types that proliferate throughout the course of an individuals life may be subjected to distinct evolutionary pressures with important implications for the cancer susceptibility of these tissues.
We also explored the relationship between gene expression heritability and constraint. Across almost all tissues h2 was significantly negatively correlated with pLI (26/27 tissues, P-value < 10-3) (Fig. S35, S36). While this trend was consistent across tissues, intriguingly it was strongest in heart tis-sues. The exception was liver, which also had the highest average among all tissues, which was only nominally significant after multiple test correction (P<0.00185). These result indicate that genes in which the variance in ex-pression is heritable tend to be under significantly less functional constraint. In contrast, highly conserved genes that are intolerant to mutation are significantly less likely to exhibit heritable variation in gene expression, likely because their expression levels are additionally under constraint.
Discussion
Studying age-associated changes in gene expression provides critical insights into the underlying biological processes of aging. Here, we set out to quantify the relative contributions of aging and genetics to gene expression phenotypes across different human tissues. Our study finds that the predictive power of eQTLs is significantly impacted by age across several different tissues and that his effect is more pronounced in older individuals. These results extend upon previous work examining blood tissue (4) and highlight the varied impact of aging on eQTLs among different tissues. We show that this result is likely to be in part due to an increase in the inter-individual heterogeneity of gene expression patterns among individuals in some contexts, potentially as a result of the increased impact of the environment. It is also of note that increased inter-individual heterogeneity in both younger and older individuals was associated with reduced predictive power of eQTLs. However, our study is limited in it’s primary focus on bulk-tissue transcriptomic data. Early evidence from single cell studies already suggests that differences in gene expression heterogeneity vary among cell types of tissues as a function of age (6, 7, 25, 26). While these studies lack sufficient individual sample sizes and genetic diversity for the statistical approaches used herein, it is possible that in the future the availability of larger datasets will facilitate studying these phenomena at the single-cell level. The extensive tissue heterogeneity we observe suggests that patterns of aging will exhibit substantial cell-type specificity.
We also present a novel approach to jointly model the impact of genetics and aging on gene expression variance to parse out the individual contributions of each of these factors. The increased complexity of our model has little impact on its accuracy with our expression heritability estimates strongly correlated with previous heritability measures across all tissues (mean Pearson’s r=0.89, Fig. S18). Using this model we show that age exhibits exceptionally varied affects on different tissues, and indeed, in several tissues age contributes more to gene expression variance on average than genetics. These results also highlight a widespread coordinated signature of age-associated decline in mitochondrial and trans-lation factors. Dysregulation in mitochondrial function and ribosome biogenesis have been documented as key players in aging, (27, 28), however our results highlight the tissue specificity of these trends. Our model also allows us to quantify the tissue-specific evolutionary context of age-associated gene expression changes. We corroborate the inverse relationship between age-at-expression and constraint, as predicted by Medawar’s hypothesis and recently documented by others (8, 9, 19) across the vast majority of tissues. However, we also surprisingly identify five tissues which exhibit the opposite pattern and show that age-associated signatures of increased proliferation and cancer are enriched in these tissues. These results highlight the distinct evolutionary forces that act on late-acting genes expressed in highly proliferative celltypes. Future work extending these analyses to the single-cell level will provide further insights into the cell-type-specific age-associated patterns of constraint, and its relevance to cancer.
Overall this work has several important implications. Our results shed light on recent work on the prediction accuracy of polygenic risk scores (PRS) (29) which found that numerous factors, including age, sex, and socioeconomic status can profoundly impact the prediction accuracy of such scores even in individuals with the same genetic ancestry. Our results highlight that genetics exhibit varied predictive power in several different tissues as a function of age, potentially playing a role in differential PRS accuracy between young and old individuals. This also has important implications for disease association and prediction approaches that leverage expression quantitative trait loci to prioritize variants, including colocalization methods (30), transcriptome-wide association studies (14, 31), and Mendelian randomization (32, 33). If a significant proportion of eQTLs exhibit age-associated biases in their effect size in a tissue of interest, then these approaches may be less powerful when applied to diseases for which age is a primary risk factor such as heart disease, Alzheimer’s dementias, cancers, and diabetes.
The critical role of aging as a risk factor for many common human diseases underscores the importance of understanding its impact on cellular systems at the molecular level. Together our analyses provide novel insights into tissue-specific patterns of aging and the relative impact of genetics and aging on gene expression. We anticipate that future studies across tissues and cells of gene expression, chromatin structure, and epigenetics will further elucidate how both programmed and stochastic processes of aging drive human disease.
Supplementary Note 1: Methods
Data collection age groupings
We downloaded gene expression data for multiple individuals and tissues from GTEx V8 (10), which were previously aligned and processed against the hg19 human genome. Tissues were included in the analysis if they had >100 individuals in both the age ≥ 55 and <55 cohorts described below (Fig. S2). For a given tissue, genes were included if they had >0.1 TPM in ≥ 20% of samples and ≥ 6 reads in ≥ 20% of samples, following GTEx’s eQTL analysis pipeline. To compare gene expression heritability across individuals of different ages, for some analyses we split the GTEx data for each tissue into two age groups, “young” and “old,” based on the median age of individuals in the full dataset, which was 55 (Fig. S1). Within each tissue dataset, we then equalized the number of individuals in the young and old groups by randomly downsampling the larger group, to ensure that our models were equally powered for the two age groups.
PEER factor analysis
We analyzed existing precomputed PEER factors available from GTEx to check for correlations between these hidden covariates and age. In particular, we fit a linear regression between age and each hidden covariate and identified significant age correlations using an F-statistic (Fig. S3). Because some of the covariates were correlated with age, we generated new age-independent hidden covariates of gene expression to remove batch and other confounding effects on gene expression while retaining age related variation. In particular, we first removed age contributions to gene expression by regressing gene expression on age and then ran PEER on the age-independent residual gene expression to generate 15 age-independent hidden PEER factors.
Quantifying the effect of eQTLs on gene expression in different age groups
Using the binary age groups defined above, we assessed the relative significance of eQTLs in old and young individuals by carrying out separate assessment of eQTLs identified by GTEx. We report the number of genes included in analysis for each tissue (Table S7). For each gene in each tissue and each age group, we regressed the GTEx pre-normalized expression levels on the genotype of the lead SNP (identified by GTEx, MAF>0.01) using 5 PCs, 15 PEER factors, sex, PCR protocol and sequencing platform as covariates, following the GTEx best practices. We confirmed our results using both our recomputed PEER factors as well as the PEER factors provided by GTEx (Fig. S4). To test for significant differences in genetic associations with gene expression between the old and young age groups, we compared the p-value distributions between these groups for all genes and all SNPs in a given tissue using Welch’s t-test. To investigate the validity of the age cutoff used for these binary age groups, we replicated the eQTL analysis using two additional age cutoffs of 45 and 65 years old. We observed the same trends in both cases; however, statistical power decreased due to smaller sample sizes in the resulting age bins, leading to a non-significant result for age cutoff 45 (Fig. S37).
Jensen-Shannon Divergence as a distance metric between transcriptome profiles
To quantify differences in gene expression between individuals, we computed the pairwise distance for all pairs of individuals in an age group using the square root of Jensen-Shannon Divergence (JSD) distance metric, which measures the similarity of two probability distributions. Here we applied JSD between pairs of individuals’ transcriptome vectors containing the gene expression values for each gene, which we converted to a distribution by normalizing by the sum of the entries in the vector. For two individuals’ transcriptome distributions, the JSD can be calculated as: where Pi is the distribution for individual i and H is the Shannon entropy function:
JSD is known to be a robust metric that is less sensitive to noise when calculating distance compared to traditional metrics such as Euclidean distance and correlation. It has been shown that JSD metrics and other approaches yield similar results but that JSD is more robust to outliers (12). The square root of the raw JSD value follows the triangle inequality, enabling us to treat it as a distance metric.
Slope of JSD distance versus age
In addition to comparing JSD between the two age groups defined above, “young” and “old”, we also binned all GTEx individuals into 6 age groups, from 20 to 80 years old with an increment of 10 years. We then computed pairwise distance and average age for each pair of individuals within each bin using the square root of JSD as the distance metric. We applied a linear regression model of JSD versus age to obtain slopes, confidence intervals, and p-values.
Cell-type specific analysis
To analyze whether cell type composition affects age-associated expression changes, we utilized the tool CIBERSORTx (16) to estimate cell type composition and individual cell type expression levels in GTEx whole blood. Cell type composition estimates were computed using CIBERSORTx regular mode. Individual cell type expression level estimates were computed using CIBER-SORTx high resolution mode. We then repeated our JSD and eQTL analyses on each cell type independently (see JSD and eQTL sections for details). In addition, to analyze tissue-specific differences in cell type composition, we referred to a previous study (34) that computed cell type composition for different GTEx tissues using CIBERSORTx. We applied the JSD metric to each tissue, using the cell type composition vector as the distribution. Additionally, we applied the Breusch–Pagan test to compute heteroskedasicity coefficients and p-values with respect to age, after inverse logit transformation to give an approximately Gaussian distribution (Fig S41) (see section on heteroskedastic gene expression).
Heteroskedastic gene expression
We used the Breusch–Pagan test to call heteroskedastic gene expression with age. For each gene and tissue, we computed gene expression residuals by regressing out age-correlated PEER factors, other GTEx covariates, and age. To test for age-related heteroskedasticity, we squared these residuals and divided by the mean, regressed them against age, and looked at the age effect size (βhet). We called significantly heteroskedastic genes using a t-test with the null hypothesis that the βhet is zero. The Benjamini-Hochberg procedure was used to control for false positives. To determine which tissues have more genes with increasing gene expression heterogeneity with age, we compare the number of genes with positive heteroskedasticity (βhet>0 and FDR<0.2) to the total of all heteroskedastic genes (FDR<0.2). We compare this metric to the per-tissue 2-bin JSD (Fig. S38) and 6-bin JSD slope (Fig. S13).
Multi-SNP gene expression prediction
We used a multi-SNP gene expression prediction model based on PrediXcan (14) to corroborate our findings from the eQTL and JSD analyses on the two age groups, “young” and “old”. For each gene in each tissue, we trained a multi-SNP model separately within each age group to predict individual-level gene expression.
Where βi,g,t is the coefficient or effect size for SNP Xi in gene g and tissue t and ε includes all other noise and environmental effects. The regularized linear model for each gene considers dosages of all common SNPs within 1 megabase of the gene’s TSS as input, where common SNPs are defined as MAF > 0.05 and Hardy-Weinberg equilibrium P > 0.05. We removed covariate effects on gene expression prior to model training by regressing out both GTEx covariates and age-independent PEER factors (described above). Coefficients were fit using an elastic net model which solves the problem ((35)):
The minimization problem contains both the error of our model predictions and a regularization term to prevent model overfitting. The elastic net regularization term incorporates both L1 (||β||1)) and ) penalties. Following PrediXcan, we weighted the L1 and L2 penalties equally using a = 0.5 (14). For each model, the regularization parameter λ was chosen via 10-fold cross validation. The elastic net models were fit using Python’s glmnet package and R2 was evaluated using scikit-learn. From the trained models for each gene, we evaluated training set genetic R2(or h2) for the two age groups and subtracted to get the difference in gene expression heritability between the groups. We compared this average difference in heritability to the mean JSDold – JSDyoung and log(Pold) – log(Pyoung) using P-values from the eQTL analyses across genes.
Joint model for expression prediction using SNPs and age
To uncover linear relationships between gene expression and both age and genetics, we built a set of gene expression prediction models using both common SNPs and standardized age as input. An individual’s gene expression level Y for a gene g and tissue t is modeled as:
Where A is the normalized age of an individual. Coefficients were fit using elastic net regularization, as above, which sets coefficients for non-informative predictors to zero. The sign of the fitted age coefficient (βage,g,t), when nonzero, reflects whether the gene in that tissue is expressed more in young (negative coefficient) or old (positive coefficient) individuals. We also evaluated the training set R2 using the fit model coefficients separately for genetics (across all SNPs in the model) and age:
We also tested whether the age-related gene expression relationship was sex-specific by rerunning the joint model with an additional age-sex interaction term as follows:
Where page*sex,g,t is the additional model weight for the agesex interaction term and S is the binary sex of the GTEx individual. The R2 of age, genetics, and the age-sex interaction term are evaluated as before by determining the variance explained by each term in the model. We compared the between the models including or excluding the age-sex interaction term (Fig. S24). We also compared the tissue-averaged variance explained by age and the age-sex interaction term. Finally, to check the consistency of tissue-specific gene expression heritability estimates from our model and the original PrediXcan model trained on GTEx data, we evaluate Pearson’s r between our heritability estimates and those of PrediXcan (Fig. S18), using heritability estimates from the original PrediXcan model available in PredictDB.
Tissue specificity of age and genetic associations
We evaluated the variability of age and genetic associations across tissues using a measure of tissue specificity for age and genetic R2(36). We measured the tissue-specificity of a gene g’s variance explained using the following metric:
Where n is the total number of tissues, is the variance explained by either age or genetics for the gene g in tissue t and is the maximum variance explained for g over all tissues. This metric can be thought of as the average reduction in variance explained relative to the maximum variance explained across tissues for a given gene. The metric ranges from 0 to 1, with 0 representing ubiquitously high genetic oi ageR2 and 1 representing only one tissue with nonzero genetic or age R2 for a given gene. We calculate Sg separately for and across all genes.
Functional constraint analysis
We quantified gene constraint using the probability of loss of function intolerance (pLI) from gnomAD 2.1.1 (20). We analyzed the relationships between pLI vs βage and pLI vs heritability across genes. For these analyses, genes were only included if age or genetics were predictive of gene expression (R2 > 0) for that gene. For genes with R2 > 0, we used linear regression to determine the direction of the relationship between pLI and βage or heritability for each tissue. The F-statistic was used to determine whether pLI was significantly related to these two model outputs. For pLI vs βage, a significant negative slope was considered a Medawarian trend (consistent with Medawar’s hypothesis) and a significant positive slope a non-Medawarian trend. To test whether the non-Medawarian trends were driven by genes with higher expression, we excluded genes in the top quartile of median gene expression and repeated the analysis between pLI and βage (Fig. S39). We also analyzed the evolutionary constraint metric dN/dS (21) and its tissue-specific relationship with βage by determining the slope and significance of the linear regression, as above.
Cancer Somatic Mutation Frequency
We quantified the per-gene and per-tissue cancer somatic mutation frequency using data from the COSMIC cancer browser (24). For each tissue, we selected the closest cancer type as noted in Table S5 and downloaded the number of mutated samples (tumor samples with at least one somatic mutation within the gene) and the total number of samples for all genes. We computed the cancer somatic mutation frequency by dividing the number of mutated samples by the total number of samples. For each tissue, we plotted the gene’s βage vs its cancer somatic mutation frequency for all genes with >200 tumor samples. We report the slope and significance of the relationship between βage and cancer somatic mutation frequency for each tissue. To determine whether age-dependent gene expression heteroskedasticity is related to a gene’s involvement in cancer (Fig. S40), we also plotted each gene’s heteroskedasticity effect size vs the cancer somatic mutation frequency for all genes with >200 tumor samples and moderately significant heteroskedasticity (FDR<0.2). Tissues with ≤5 genes meeting these criteria are not plotted.
Non-Medawarian tissue analysis
To explore the non-Medawarian trend in some tissues, we assessed the distribution of βage across Medawarian and non-Medawarian tissues for genes within each of the 50 MSigDB hallmark pathways (22). Significant differences between the distributions were called using a t-test, and p-values were adjusted for multiple hypothesis testing using a Benjamini-Hochberg correction.
Code and data availability
All analyses were performed in R version 4.0.2 and Python 3.6. All code is available online at https://github.com/sudmantlab/gene_expression_aging and archived at https://doi.org/10.5281/zenodo.6534137. Full results for joint age and genetic model can be found on Zenodo https://doi.org/10.5281/zenodo.6533954.
AUTHOR CONTRIBUTIONS
RY, RC, JMV, HS, PS, and PHS performed all analysis. RY, RC, NMI, and PHS wrote the manuscript. PHS and NMI supervised the project. PHS conceived of the project.
Supplementary Figures
ACKNOWLEDGEMENTS
This work was supported by the National Institute of General Medical Sciences grant R35GM142916 to P.H.S. andthe National Human Genome Research Institute grant R00HG009677 to N.M.I.