Predicting Alzheimer’s Cognitive Resilience Score: A Comparative Study of Machine Learning Models Using RNA-seq Data

Abstract Alzheimer’s disease (AD) is an important research topic. While amyloid plaques and neurofibrillary tangles are hallmark pathological features of AD, cognitive resilience (CR) is a phenomenon where cognitive function remains preserved despite the presence of these pathological features. This study aimed to construct and compare predictive machine learning models for CR scores using RNA-seq data from the Religious Orders Study and Memory and Aging Project (ROSMAP) and Mount Sinai Brain Bank (MSBB) cohorts. We evaluated support vector regression (SVR), random forest, XGBoost, linear, and transformer-based models. The SVR model exhibited the best performance, with contributing genes identified using Shapley additive explanations (SHAP) scores, providing insights into biological pathways associated with CR. Finally, we developed a tool called the resilience gene analyzer (REGA), which visualizes SHAP scores to interpret the contributions of individual genes to CR. REGA is available at https://igcore.cloud/GerOmics/REsilienceGeneAnalyzer/ .


Introduction
Alzheimer's disease (AD) remains a critical area of research because of its largely unknown causes and pathophysiology, as well as the urgent need for effective treatments.
In AD, cognitive resilience (CR) refers to the ability of some individuals to maintain cognitive function despite the presence of amyloid plaques and neurofibrillary tangles.These plaques and tangles are the hallmark pathological features of the disease.CR is defined as the discrepancy between an individual's observed and expected cognitive functions based on the extent of brain pathology [1][2][3].Several factors have been associated with CR, including sex/gender differences [4], educational level, and brain weight [5,6].Studies have also identified other factors, such as personality traits, Parkinson's disease, depression, life activities, and eudaimonic well-being, to CR [7][8][9].Despite these findings, the molecular mechanisms underlying CR remain unclear.
However, commonly used statistical methods in omics studies frequently fail to detect significant factors or provide detailed mechanistic insights into CR.Machine learning algorithms have been applied to large and complex datasets such as RNA-seq data [31,32].
These algorithms have been successful in identifying disease-related gene signatures, predicting prognosis, stratifying patients, and facilitating the classification of disease subtypes [33][34][35][36][37][38].For example, Ahammad et al. developed an AD prediction tool called AITeQ, which utilizes machine learning models and RNA-seq data to identify a set of genes with high accuracy [38].While machine learning has been successfully applied to AD prediction, few studies have focused on utilizing these techniques to analyze omics data, particularly for CR.
In this study, we aimed to analyze the pathology of CR in AD by constructing regression models.These models predict CR scores using RNA-seq data obtained from the ROSMAP [39] and MSBB [40] cohorts.For both datasets, the support vector machine (SVM) model exhibited the best performance.In addition, we used Shapley additive explanations (SHAP) [41] scores to identify genes contributing to the model's predictions, revealing contributions from genes associated with energy metabolism pathways.Finally, we developed a resilience gene analyzer (REGA), which visualizes the SHAP scores calculated by the best-performing SVR model.This study lays the foundation for applying machine learning to cognitive resilience research in AD and paves the way for experimental validation of the findings.

Materials and methods
Fig. 1 shows the analytical workflow used in this study.The detailed flow from data collection and pre-processing to training is shown in Supplementary Fig. S1.

Data Collection and Preprocessing
To calculate the CR score, we used RNA-seq data from the MSBB and ROSMAP cohorts, which included both cognitive function and pathological assessments related to amyloid-β and tau.We used conditional quantile-normalized data from these cohorts (MSBB: https://doi.org/10.7303/syn2580853;ROSMAP: https://doi.org/10.7303/syn2580853).The details of sample collection and data processing have been described previously [39,40].
Outliers identified using principal component analysis were excluded.In the ROSMAP cohort, the tissues included the dorsolateral prefrontal cortex, head of the caudate nucleus, and posterior cingulate cortex.The head of the caudate nucleus and sequencing batch number 9 were excluded from the analysis.The MSBB cohort included tissue from the frontal pole, inferior frontal gyrus, parahippocampal gyrus, prefrontal cortex, and superior temporal gyrus.Finally, we utilized 1247 samples from the MSBB cohort and 1549 samples from the ROSMAP cohort, including patients with dementia and healthy controls.The batch effects were corrected using ComBat [42] based on the batch information of each dataset.

Quantification of CR
A linear regression analysis was performed to calculate CR scores.The score for each individual was defined as the difference between observed and predicted cognitive functions.
To predict cognitive function, clinical dementia rating (CDR) was used as the target variable for the MSBB cohort and the mini-mental state examination (MMSE) was used for the ROSMAP cohort.The explanatory variables included the CERAD score and Braak stage.

Feature Selection
In this study, features with high Pearson correlation coefficients and CR scores were selected for analysis using feature sets of 1000, 2000, 3000, 4000, and 5000.For example, in the case of a 1000-feature set, the top 500 features with positive correlations and the top 500 features with negative correlations were selected.The features were normalized using the MinMaxScaler function from the scikit-learn library in Python.

Machine Learning Models Linear Regression Model
Linear regression is a basic statistical method that is used to model the relationship between a dependent variable and one or more independent variables.In this study, we implemented a linear regression model using the PyTorch library in Python.The model was trained using the Adam optimizer and the mean squared error (MSE) loss function.The hyperparameter settings included a learning rate of 0.01, 300 epochs, and a batch size of 16.

Support Vector Regression
Support vector regression (SVR) [43] is a type of SVM used for regression tasks.
SVR aims to determine a function that approximates the target values with minimal deviation by fitting a hyperplane within a margin of tolerance to the data points.The SVR is robust against outliers and can be used to model nonlinear relationships using kernel functions.This method is particularly useful for high-dimensional complex data.The SVR function from the scikit-learn library was used.

Random Forest
Random forest [44] is an ensemble learning method that constructs multiple decision trees during training and outputs the average predictions of individual trees.This approach improves the prediction accuracy and controls overfitting by averaging the results of multiple trees.Each tree was trained on a random subset of the data, introducing diversity and reducing variance.Random forests can handle many features and capture intricate data patterns.In this study, we used the RandomForestRegressor function from the scikit-learn library.

Extreme gradient boosting
Extreme gradient boosting (XGBoost) [45] is an advanced implementation of gradient boosting designed for speed and performance.It builds an ensemble of trees sequentially, with each new tree correcting the errors of the predecessor.XGBoost uses regularization techniques to prevent overfitting and incorporates various optimization algorithms to improve the training efficiency.We used the XGBRegressor from the XGBoost library.

Transformer-based Model
Initially developed for natural language processing tasks [46], transformer-based models have shown great promise in various domains, including RNA-seq [14,47].These models utilize self-attention mechanisms to weigh the importance of different input features, making them highly effective for modeling complex relationships.In this study, we used a recently reported transformer-based model [47] with some modifications to predict CR by leveraging its ability to learn intricate patterns from high-dimensional RNA-seq data.The architecture of our transformer-based model includes two attention layers, each with four multihead attention mechanisms.We used the GELU activation function and included dropout layers at a rate of 0.4 to prevent overfitting.The feedforward network within the transformer model has a hidden layer dimensionality of 256.The model was optimized using the Adam optimizer with a learning rate of 0.0001.Additionally, the model was trained for 500 epochs with a batch size of 16.

K-fold Cross Validation and Hyperparameter Tuning
To ensure the robustness and generalizability of the predictive models, k-fold cross-validation and hyperparameter tuning were implemented.We utilized the scikit-learn library to perform grid-search cross-validation and systematically evaluated different hyperparameter combinations to identify the optimal configuration for each model.We employed a nested cross-validation approach with an outer loop of 5-fold cross-validation to assess generalization performance.An inner loop of 5-fold cross-validation was used within the training set to perform a grid-search for hyperparameter tuning.For each fold in the outer cross-validation, the data were split into training and test sets.Within the training set, grid-search cross-validation was performed using the defined hyperparameter grid.The best hyperparameters were selected based on the lowest MSE from the inner loop.The best model was retrained using the entire training set and evaluated using the test set.
Performance metrics, including R² and the root mean squared error (RMSE), were calculated for both the training and test sets.The average performance metrics across all folds were computed to provide an overall assessment of the model performance.This systematic approach allowed us to thoroughly evaluate and optimize our models, ensuring reliable and accurate predictions of CR.Detailed results, including the hyperparameters and performance metrics for each fold, are provided in Supplementary Tables S2 and S4.
Explaining Predictions Using SHAP Using the SHAP library, we visualized the genes that contributed to the predictions made by the best models.For the SVM model, which demonstrated the best performance in terms of R² across all rounds of cross-validation for each dataset, the SHAP scores were calculated using 100 test samples with the KernelExplainer function.The results were plotted using the summaryplot function.
Enrichment Analysis for Top SHAP Score Genes Enrichment analysis was performed to explore the biological functions of the target gene set.

Establishment of REGA
Using the R Shiny app, we developed a tool named REGA to visualize the SHAP score output from the MSBB and ROSMAP cohorts.By selecting a dataset and genes, users can observe the distribution of SHAP scores for the selected genes and plot the SHAP scores for individual samples, indicating the contribution of the selected genes to resilience score predictions.

Results
Superior Performance of SVR in Predicting Resilience Score in the MSBB Cohort First, we constructed models to predict resilience scores using the MSBB cohort and compared their prediction accuracies on the test dataset.We selected features based on the top 1000, 2000, 3000, 4000, and 5000 Pearson correlation coefficients with the resilience score.For training, we used 5-fold cross-validation and evaluated the models based on the average RMSE and R² values.The SVR model exhibited the best performance across all the feature sets.In particular, the SVR model with 3000 features exhibited the best performance, with an average RMSE of 0.621 (standard error, SE 0.0137) and an average R² of 0.614 (SE 0.0137) (Fig. 2, Supplementary Fig. S2, and Supplementary Tables S1 and S2).Following the SVR model, the transformer, XGBoost, and random forest models showed progressively lower performance.The Linear model exhibited the lowest prediction accuracy.In addition, no significant differences were observed in the results based on the number of features used.

Superior Performance of SVR and Transformer Models in the ROSMAP Cohort
We performed a similar analysis using the ROSMAP cohort data.In the ROSMAP cohort, the transformer model exhibited the best performance with 1000 and 3000 features, whereas the SVR model exhibited the best performance with 2000, 4000, and 5000 features.
The SVR model consistently exhibited the best performance using 5000 features.This model achieved an average RMSE of 0.831 (SE 0.0141) and an average R² of 0.302 (SE 0.0252) (Fig. 3, Supplementary Fig. S3, Supplementary Tables S3 and S4).

Explanation of the Best Models Through SHAP Scores
In the MSBB and ROSMAP datasets, the SVR model exhibited the best performance (R² = 0.649 with 3000 features in the MSBB dataset and R² = 0.364 with 4000 features in the ROSMAP dataset) (Supplementary Tables S2 and S4).To identify the genes contributing to model predictions, SHAP values were calculated for the best model in each dataset using 100 test samples (Fig. 4).The SHAP values indicate the extent to which each feature (gene) contributes to the output of the model.SHAP scores are based on cooperative game theory, which considers all possible combinations of features to fairly distribute the "payout" among them [41].Genes such as PKP1, NPY2R, FRG1JP, and ADAMTS2 were identified in the MSBB cohort, whereas YBX2, ADRA1D, TIMM8A, ANGPT2, and SLC6A13 were identified in the ROSMAP cohort.

Enrichment Analysis of Top SHAP Score Genes
The top 500 SHAP score genes from each dataset were extracted, and common genes were identified, resulting in 39 common genes (Supplemengary Fig. S4A).
Enrichment analysis was performed for each dataset.The 39 common genes between the two cohorts showed significant enrichment in the ATP metabolic process and amino sugar catabolic process pathways with a BH-adjusted p-value of less than 0.05 (each 0.0478) (Supplementary Fig. S4B and Table 1).

Establishment of REGA
Finally, using the R Shiny application, we developed a tool to visualize the SHAP scores calculated from the best SVR models in both the MSBB and ROSMAP cohorts.This tool is accessible at https://igcore.cloud/GerOmics/REsilienceGeneAnalyzer/.

Discussion
In this study, we constructed predictive models for AD resilience scores using RNA-seq data and machine learning techniques.Our results showed that compared to the random forest, XGBoost, linear, and transformer-based models, the SVR model had the best performance in the MSBB cohort, whereas both the SVR and transformer-based models exhibited the best performance in the ROSMAP cohort.
Traditional statistical methods in AD resilience research often face challenges, such as low correlation and difficulty in detecting variable genes, particularly in CR research.These approaches may not identify all relevant molecules or capture the nuanced effects of resilience, particularly given the significant differences between pathological and normal conditions.In our study, the machine learning SVR model achieved an R² of 0.649 in the MSBB cohort, demonstrating a relatively high accuracy for the regression models.Nevertheless, the prediction accuracy in the ROSMAP cohort was lower, possibly because of differences in RNA-seq quality, batch effects, and variability in resilience scores.Future studies should focus on expanding the datasets to include more accurate clinical and resilience scores, thereby improving model robustness and prediction accuracy.Transformer-based models, which have recently been reported to achieve high accuracy in cancer classification studies [47], have ranked second in this study.This outcome can be attributed to the sample size constraints.This study used data from 1247 samples in the MSBB cohort and 1549 samples in the ROSMAP cohort, whereas the previous study used data from 10,340 and 713 samples for 33 cancer types and 23 normal tissues, respectively.Transformer-based models may require larger datasets to effectively learn complex relationships within the data.However, the transformer model exhibited the best performance in several cases in the ROSMAP cohort with a larger sample size.SHAP [41] analysis identified genes contributing to the resilience score predictions in the best-performing models for both the MSBB and ROSMAP cohorts.Many of these genes have been previously associated with AD [48][49][50][51][52][53][54][55].For example, somatostatin (SST) has been linked to abnormal levels of neuropeptides, impaired function, and hyperactivity of SST-positive interneurons (SST-IN) co-localized with amyloid-β plaques in AD [49,50].
Recent single-nucleus RNA-seq studies have reported that SST neurons are associated with cognitive resilience [12].Additionally, enrichment analysis of 39 common genes revealed enrichment in ATP metabolic and amino sugar catabolic process pathways.
Previous studies have also reported that amino acid metabolism pathways are associated with CR [21].Additionally, amyloid positivity without cognitive impairment was associated with the preservation of youthful brain aerobic glycolysis [56].was the resilience score, and the features were the expression levels of genes that were highly correlated with the resilience score.The models used in the present work included linear models, SVR [43], random forest [44], XGBoost [45], and transformer-based models [46,47].Learning was performed using 5-fold cross-validation.The evaluation metrics were RMSE and R², and gene contributions were calculated using the SHAP scores [41] in the best-performing model.and the accuracy of the model was evaluated using the testing dataset.
Table 1.Top 10 pathways from the GOBP, Reactome, Hallmark, and KEGG databases based on q-values for the combined dataset, as well as for the MSBB and ROSMAP cohorts individually.

Furthermore, we developed
an SVR-based resilience score prediction tool called REGA.The SVR model provided the best results for both the ROSMAP and MSBB cohorts.This tool allows users to analyze how well the SVM explains resilience based on one or more genes detected in either cohort.It is anticipated that this tool will provide valuable reference data for analyses and experimental validation across different datasets.This study has several limitations, with sample size being a critical factor.Recently, transformer-based models have shown high prediction accuracy in various cases compared to existing machine learning models.However, because of their complexity, transformers require large sample sizes to achieve high accuracy.In this study, the limited sample size led to the identification of SVM as the best model.Reconstructing a model with a larger sample size may reveal different models with better accuracy.Increasing the sample size is expected to enhance the overall prediction accuracy.Second, the robust measurement of resilience scores is challenging.Although the same explanatory variables (Braak stage and CERAD score) were used in both the ROSMAP and MSBB cohorts, different target variables (MMSE and CDR scores) were used to construct the linear models and calculate the resilience scores.This methodological difference may be a reason for the lower prediction accuracy of the ROSMAP cohort.Different methods of calculating the resilience scores prevented data integration and cross-validation of the models between cohorts.Finally, biological and technical variations in RNA-seq data, such as batch effects, sequencing depth, and normalization methods, can introduce variability.Machine learning models are susceptible to these factors; therefore, appropriate data preprocessing and normalization methods are essential to ensure accurate classification.

Fig. 1 .
Fig. 1.Overview of the data analysis performed in this study.RNA-seq datasets from the

Fig. 2 .
Fig. 2. Prediction performances of different machine learning models using various feature

Fig. 3 .
Fig. 3. Prediction performances of different machine learning models using various feature

Fig. 4 .
Fig. 4. SHAP values of the best models for predicting resilience scores in the (A) MSBB and

Fig. 5 .
Fig. 5. Visualization of gene contribution to the prediction of resilience scores.The REGA (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made Mayo and Michael J Fox foundations, the Arizona Department of Health Services, and the Arizona Biomedical Research Commission.We extend our gratitude to the participants of the Religious Order Study and Memory and Aging projects for their generous donations.Additionally, we thank the Sun Health Research Institute Brain and Body Donation Program, Mayo Clinic Brain Bank, and Mount Sinai/JJ Peters VA Medical Center NIH Brain and Tissue Repository.Data and analysis contributors include Nilüfer