## Abstract

Recent associations between Major Depressive Disorder (MDD) and measures of premature aging suggest accelerated biological aging as a potential biomarker for MDD susceptibility or MDD as a risk factor for age-related diseases. Statistical and machine learning regression models of biological age have been trained on various sources of high dimensional data to predict chronological age. Residuals or “gaps” between the predicted biological age and chronological age have been used for statistical inference, such as testing whether an increased age gap is associated with a given disease state. Recently, a gene expression-based model of biological age showed a higher age gap for individuals with MDD compared to healthy controls (HC). In the current study, we propose a machine learning approach that simplifies gene selection by using a least absolute shrinkage and selection operator (LASSO) penalty to construct an expression-based Gene Age Gap Estimate (GAGE) model. We construct the LASSO-GAGE (L-GAGE) model in an RNA-Seq study of 78 unmedicated individuals with MDD and 79 HC and then test for accelerated biological aging in MDD. When testing L-GAGE association with MDD, we account for factors such as sex and chronological age to mitigate regression to the mean effects. The L-GAGE shows higher biological aging in MDD subjects than HC, but the elevation is not statistically significant. However, when we dichotomize chronological age, the interaction between MDD status and age is significant in L-GAGE model. This effect remains statistically significant even after adjusting for chronological age and sex. We find cytomegalovirus (CMV) serostatus is associated with elevated L-GAGE. We also investigate feature selection methods Random Forest and nearest neighbor projected distance regression (NPDR) to characterize age related genes, and we find functional enrichment of infectious disease and SARS-COV pathways.

## 1. Introduction

Major depressive disorder (MDD) has been hypothesized to show characteristics of premature aging [1]. Biological aging can be measured in multiple dimensions such as telomere length, immunosenescence, brain volume, and gene expression. These measures of biological aging are correlated with chronological age, but environmental and genetic factors can increase or decrease an individual’s biological age relative to their chronological age and influence their risk for age related diseases. For example, MDD has been associated with markers of cellular and immune aging including shortened leukocyte telomere length [2, 3], elevated indicators of oxidative stress[4], and elevated circulating inflammatory cytokines [5]. Epigenetic clocks predicting biological age based on the accumulation of methylated CpG sites have found higher biological age in MDD subjects compared with healthy controls [6]. Brain age models constructed from T1-weighted magnetic resonance image (MRI) data from 2,188 healthy controls predicted a gap of +1.08 years (SE 0.22) between predicted and chronological age across 2,675 depressed subjects [7].

A recent RNA-Seq MDD study from Cole at el. found that gene expression based biological aging was elevated in MDD subjects compared to HC [8]. The PBMC samples included four groups: 44 healthy controls, 94 MDD treatment-resistant, 47 MDD treatment-responsive and 46 MDD untreated [8]. They selected age genes iteratively by varying the P-value threshold for the t-test between upper and lower chronological age quartiles. For a given iteration, a biological age was computed for each subject based on the signed z-score of the age-related genes, and the P-value threshold was chosen to optimize the correlation between biological and chronological age of the subjects (Spearman Correlation Coefficient (SCC) = 0.72, p < 0.01). A linear model of biological age was fit to chronological age and association with MDD was computed by comparing the number of MDD and HC subjects above and below the regression line.

In the current study, we create a biological age model from RNA-Seq gene expression using a multivariate LASSO penalized regression rather than an iterative univariate test, and we use age as a quantitative variable during the feature selection in linear regression, as opposed to using age quartiles as in Ref.[8], which allows our model to include more variation when estimating the age model. When later using chronological age as covariate for MDD association, we dichotomize chronological age. LASSO allows automatic feature selection of a multivariate linear regression model based on the cross-validated penalty hyperparameter optimization. We train the LASSO biological age model using an existing RNAseq dataset consisting of 157 individuals (78 with MDD and 79 healthy controls) [8, 9], and we use the residual of the LASSO model as an estimate of the gap between an individual’s chronological age and their biological gene age. A positive gap indicates higher than average biological age or elevated aging compared to chronological age. This LASSO Gene Age Gap Estimate (L-GAGE) shows elevated biological aging in MDD subjects compared to HC, but the elevation is not statistically significant. However, when we dichotomize chronological age into older and younger, the interaction between MDD status and age is significant in L-GAGE model. Finally, we use machine learning feature selection to explore biological pathways that are significantly enriched for the gene sets identified as being associated with aging.

## 2. Materials and Methods

### 2.1. RNA-Seq Data

To test our biological age models, we use an extant RNA-Seq dataset [10]. The study was approved by the Western Institutional Review Board and conducted according to the principles expressed in the Declaration of Helsinki. The data consists of 78 MDD and 79 HC subjects (91 females and 66 males). Individuals with current symptoms of depression met DSM-IV-TR criteria for MDD based on the Structural Clinical Interview for DSM-IV-TR Axis I Disorders and an unstructured psychiatric interview. HC individuals had no personal or immediate family history of major psychiatric disorders. MDD participants were unmedicated for at least 3 weeks prior to study entry. Exclusion criteria included major medical or neurological illness, psychosis, traumatic brain injury, and a history of drug/alcohol abuse within 1 year. There is a higher female/male ratio for MDD (51/27) than HC (40/39), compatible with trends in the general population. The age distribution is slightly skewed towards younger individuals with age range from 18 to 55 (Fig. 1). The 8,923 genes in the RNA-Seq gene expression data are normalized by counts per million reads, which we then quantile normalize and log2 transform to stabilize variance. We removed genes with a low coefficient of variation (standard deviation divided by absolute mean). We chose a threshold of 0.045 to obtain 5,587 genes.

### 2.2. Gene Age Gap Estimate (GAGE) using RNA-Seq

We use LASSO for gene selection and modeling biological age, and then we use the residual of this model, which we call LASSO Gene-Age Gap Estimate (L-GAGE), for association testing with MDD. For the LASSO biological aging model, we build a full penalized regression model with all gene expression variables and with chronological age as the outcome variable. We include both MDD and HC samples in the age model, which was also the approach in Ref. [8]. Our biological age model is based on the non-zero coefficient genes from the lambda-1se LASSO penalty (the largest *λ* for which the average cross-validation (CV) error is within one standard error of the minimum CV error). We compute the gap/residuals of the LASSO model between predicted biological age and chronological age (i.e., the L-GAGE score). Our goal is to use L-GAGE to test for increased biological age in MDD subjects (Fig. 2).

### 2.3. Relationship between gene age gap, chronological age, MDD and sex

It is important to consider adjustments for chronological age in biological age models because of regression to the mean as discussed for brain age models [11], but sex is also an important covariate for MDD. To further explore covariate effects, we add MDD x Age and MDD x Sex interactions for L-GAGE associations with MDD. We use the OLS model
where Z represents the adjustment or interaction variable (Age or Sex). We focus on the effect of β_{3}, which represents how much the average L-GAGE of the MDD group changes for the Z=1 condition.

We consider two cases when age is used as a covariate with interactions (Z in Eq. 1): as continuous and as dichotomous with a threshold. To verify our choice of age threshold, we use a threshold regression model in the “chngpt” package in R [12]. We use this approach to check for possible nonlinear relationship between MDD and age and whether the effect of chronological age on MDD increases at some threshold point. The mean function of the threshold model is: where x stands for chronological age, e is the age threshold and z are additional predictors. “I” is a step indicator function. The threshold is optimized using the exact criterion function with a logistic-based smooth function.

### 2.4. Feature selection, Gene-Age pathway Enrichment, and interpretable classifier

We use LASSO to create the gene-based residual age model, L-GAGE, but LASSO feature selection also results in a set of age-related genes. As a secondary analysis, we use LASSO and other feature selection methods to identify important age-related genes for pathway enrichment to understand the biological mechanisms of the age models. We use univariate linear regression, random forest (RF) regression, and nearest-neighbor projected distance regression (NPDR) [10] as feature selection methods. RF has the ability to find more complex models than LASSO and linear regression, but RF has limited ability to detect interactions [13], whereas NPDR has the ability to detect interaction effects [10]. For univariate feature selection, we use a linear model of individual genes with age, and we use a P-value threshold of 0.05 (uncorrected for improved pathway overlap). We use the standard NPDR with an adjusted P-value threshold of 0.05 FDR, and we use the LASSO penalized NPDR. For NPDR, we use the imbalanced k-nearest-neighbor value (k=47) that approximates the 0.5 standard deviation of the hyper-radius [10]. We use permutation variable importance with RF. We use the Reactome Pathway database in MSigDB [14, 15] for biological pathway enrichment of age related genes. For additional interpretation of the gene-age prediction of MDD along with consideration for other covariates, we train a decision tree to predict MDD based on L-GAGE, chronological age, and sex. Decision trees have high variance, but they are useful for interpreting the relationships between covariates.

## 3. Results and Discussion

### 3.1. Testing Association of Gene Age L-GAGE with MDD

We test for association of the LASSO Gene Age Gap Estimate (L-GAGE) score with MDD status. L-GAGE is the residual from a LASSO gene expression model of chronological age. The LASSO model uses the cross-validation tuned lambda-1se value (*λ = 1*.*636048*), which is the largest *λ* at which the mean-squared error (MSE) is within one standard error of the minimum MSE. The residuals are constant, and heteroscedasticity is not present based on the Non-constant Variance Score Test. The penalty results in a multivariate linear model of age with 22 genes and a Spearman Correlation Coefficient (SCC) with chronological age of 0.77 (Fig. 2). Counting the number of HC or MDD above or below the regression line (Fig. 2), we find that the biological age is greater in MDD subjects than HC (HC – 45 (56.96 %) below, 34 (43.037%) above, MDD 35 (44.87%) below, 43 (55.128%) above). The P-value of the Chi-squared test of GAGE sign (above or below the line) for MDD is not significant (0.1753). The greater L-GAGE in MDD versus HC can be seen in L-GAGE density (Fig. 3A). The L-GAGE distribution for males and females is very similar (Fig. 3B). While L-GAGE is greater in MDD than HC subjects, we do not find a statistically significant replication of the effect found in Ref. [8]. However, we do see a suggestive difference with an effect size similar to what they found. Using the same genes as their model also does not replicate.

### 3.2 Testing MDD-Age interaction for L-GAGE association model

We test for the effect of L-GAGE on MDD by introducing an MDD-Age interaction term (Eq. 1). Dichotomizing age at threshold 40, MDD alone is not significant, but we find a statistically significant effect of the interaction between MDD and Age 40 on L-GAGE (Table 1, Fig. 4). For individuals younger than 40, L-GAGE shows very little difference between MDD and HC, but for older individuals, there is greater biological aging (L-GAGE) for the MDD versus HC group (Fig. 4 and Table 1). Age alone is also statistically significant (Table 1). These age effects remain significant when we add sex as a covariate (Table 1B), but sex is not significant (Table 1B and Table 2).

The MDD-Age interaction and the MDD term (Eq. 1) do not have a significant effect on L-GAGE when age is treated as a continuous variable (MDD P-value = 0.364, Age P-value = 0.316, MDD*Age P-value = 0.197). Also, there is no direct statistical association between MDD and age and between MDD and sex (Two Sample T-test of MDD and Chronological age: P-value = 0.167; Chi-squared-test of MDD and sex: P-value = 0.08716). To further support our choice of age threshold, we use a threshold regression (Eq. 2). The change point for age in relation to MDD is estimated to be 39 years (Fig. 5). Combined with the third quartile being age 41, the threshold regression suggests that age 40 is a suitable cutoff point for dividing the subjects into two age groups.

Additional support for the age-40 threshold can be seen in the decision tree for predicting MDD (Fig. 6), where age with threshold 39.5 is the second important split variable, following L-GAGE. The decision tree also suggests interaction effects, where the effect of L-GAGE on MDD is conditioned on chronological age. If L-GAGE (top node) is below a threshold, subjects tend to be HC. If the L-GAGE is below the threshold and chronological age is above 39.5 (i.e., an interaction), subjects tend to be MDD. However, for chronological age less than 39.5, the prediction of MDD is more complex (Fig. 6). We note that this decision tree was trained on the full dataset to maximize power, but it is instructional for interpretation.

A subset of our subjects (136 out of 157) have anti-CMV (human cytomegalovirus) IgG antibody data. Of the 136 samples, 70 are CMV seropositive and 66 CMV seronegative CMV. Although the P-value is not significant (0.097), we find that the mean biological age gap (L-GAGE) is higher in CMV positive subjects compared to CMV negative (Fig. 7A). For the subset of subjects with both CMV data and MDD status data, there are 75 HC and 61 MDD and 83 female and 54 male. While CMV positive subjects tend to have an elevated biological age, the effect is not MDD or sex specific (Fig. 7B and 7C). In other words, being CMV positive elevates gene age regardless of MDD/HC status or sex.

### 3.3 Characterizing Age-Associated Genes

The LASSO regression used in L-GAGE selected 22 age genes with non-zero coefficients (Table 3). We broaden the characterization of age related genes in our MDD data through pathway enrichment from statistical and machine learning feature selection methods linear regression, RF, and nearest-neighbor projected distance regression (NDPR) [10]. Across all feature selection methods, the four common age genes are NAA20 (N-alpha-acetyltransferase 20), CCNE1 (Cyclin E1), and SESTD1 (SET domain containing protein 1A), and TAF9 (TATA-box-binding protein associated factor 9). Using the feature selection gene sets and the Reactome database, we find enrichment for Infectious Disease, Adaptive Immune System, and SARS-CoV-2 Infection pathways (Tables 5 and 6). SARS-CoV-2 can cause neurological complications, and a recent study showed that differentially expressed genes for COVID infection overlap with many gene associations for neuropsychiatric disorders including depression [16].

## Conclusion

We presented a procedure for creating an expression-based biological age model using LASSO penalized regression, and we explored the association of the residual, or the LASSO-based Gene Age Gap Estimate (L-GAGE) on MDD while adjusting for chronological age and sex. We found increased biological aging based on L-GAGE in MDD versus HC subjects with an effect size similar to a previous study [8], but the difference was not statistically significant. Larger sample sizes are needed to further test this effect. We found a statistically significant MDD-Age interaction for L-GAGE when age is dichotomized with threshold 40 years. We used multiple statistical criteria for choosing this threshold. This finding could indicate an effect of lifetime number of MDD episodes on biological aging that is not detectible until middle-age. The interaction effect remained significant when adjusting for chronological age and sex, and we reiterate the importance of including age in L-GAGE association tests to avoid confounding due to regression to the mean [11].

We explored the top age-associated genes with different feature selection methods, and we identified a consensus set of genes, CCNE1, NAA20, SESTD1, and TAF9 that have been associated with aging, senescence, and infectious disease. In a study of Lung Adenocarcinoma, CCNE1 gene expression was found to be correlated with patients’ age [17], and NAA20 and SETD1A are involved in senescence, which is related to aging and age-related diseases. It was shown that depletion of NAA20 in non-transformed mammal cells led to senescence [18], and in another study knockdown of SETD1A triggered cellular senescence. [19]. TAF9 cross-reactivity was shown to be associated with immunity to CMV in the context of autoimmune disease [20]. Recall, we found that CMV positive status is associated with elevated biological age based on L-GAGE. Pathway enrichment of the broader set of age genes selected by linear regression, random forest, and NPDR resulted in the detection of Infectious Disease, Adaptive Immunity, and SARS-CoV Infection pathways. As noted in Ref. [8], evaluating PBMC transcription can increase the risk for false positive immune pathways.

This study contributes a new approach to estimating biological aging and contributes to the evidence for the role of aging and inflammation in depression. Future studies are needed with broader age ranges, more uniform age distributions, large sample sizes, and utilization of MDD age-of-onset and number of depressive episodes. Future gene age models may help identify individuals that need different treatment or management for depression due to an increase in their relative biological age.

## Research data for this article

Data and code for this research will be available at https://github.com/insilico/GeneAgeMDD.

## Funding

BAM and JS received support from the National Institute of Mental Health (R01MH098099).