Enhancing Neurocognitive Outcome Prediction in Congenital Heart Disease Patients: The Role of Brain Age Biomarkers and Beyond

This paper aimed to investigate the predictive power of combining demographic, socioeconomic, and genetic factors with a brain MRI-based quantified measure of accelerated brain aging (referred to as deltaAGE) for neurocognitive outcomes in adolescents and young adults with Congenital Heart Disease (CHD). Our hypothesis posited that including the brain age biomarker (deltaAGE) would enhance neurocognitive outcome predictions compared to models excluding it. We conducted comprehensive analyses, including leave-one-subject-out and leave-one-group-out cross-validation techniques. Our results demonstrated that the inclusion of deltaAGE consistently improved prediction performance when considering the Pearson correlation coefficient, a preferable metric for this study. Notably, the deltaAGE -augmented models consistently outperformed those without deltaAGE across all cross-validation setups, and these correlations were statistically significant (p-value < 0.05). Therefore, our hypothesis that incorporating the brain-age biomarker alongside demographic, socioeconomic, and genetic factors enhances neurocognitive outcome predictions in adolescents and young adults with CHD is supported by the findings.

Predicting neurocognitive outcomes is an urgent unmet need. In the USA, more than 30,000 infants per year survive CHD but half of them develop neurocognitive impairments in adulthood [8][9][10][11][12][13][14][15][16][17] . Intervention before adulthood may improve outcomes 21 , by promoting parent-child relationships, individual psycho-education, outreach to community healthcare providers 22 , and home-administered computerized training programs 23,24 . However, the bottleneck problem is: how to identify CHD patients at high risk for neurocognitive impairments. High-risk patients should be ideal candidates for interventions, while interventions on low-risk patients should be avoided. Current work inferring adulthood neurocognitive outcomes are very few and mostly used clinical variables or traditional brain MRI metrics, i.e., volumes 14,25 . However, they explain only 1/3 of the neurocognitive outcomes [26][27][28] , insufficient to support interventional clinical trials.
Early prediction of later-life neurocognitive outcomes will create a precious time window for early intervention 29,30 . It will identify high-risk patients for targeted intervention, avoiding unnecessary interventions for patients at low risk for future neurocognitive impairment 31 . Both the early and the targeted interventions are key unmet needs in clinical trials that aim to improve CHD patients' long-term neurocognitive outcomes 21,30 . Very few existing studies use brain magnetic resonance images (MRIs), demographics, socioeconomic status (SES), or genetic factors to predict later-life neurocognitive outcomes [32][33][34][35][36][37][38] . In this paper, we test our overall hypothesis that combining demographics, SES, or genetic factors, and adding a brain MRI-based quantified severity of accelerated brain aging, can better predict neurocognitive outcomes than without the brain age biomarker.

Neurocognitive Score Prediction
To test the hypothesis that combining brain MRI-based quantified severity of accelerated brain aging to demographics, SES, and genetic factors can better predict neurocognitive scores than without the brain-age biomarker, we train and validate each of our predictive models in leave-one-sample-out as well as leave-one-group-out cross-validation in two stages. In the first stage, we use all features including the brain-age biomarker, and in the second stage, we use all features except the brain-age biomarker.

Estimation of Accelerated Brain Aging
In a recent study 39 , we trained a deep learning brain age estimator on T1-weighted brain MRIs of 16,705 healthy brain MRIs 40 and produced the prediction of brain age for 96 adolescents and young adult survivors of CHD (accessed from the same PCGC dataset as in this study; 8-30 years of age). We computed the severity of brain aging by subtracting deep model-estimated and the actual chronological ages (deltaAGE; we refer to it as the MRI-based brain-age biomarker henceforth). Using a T-test with normal controls, we confirmed the existence and severity of accelerated brain aging (i.e., deltaAGE > 0 with p-value < 0.05).

Prediction by Regression Forests
We used a regression forest that ensembles regression decisions from 5-layered (i.e., depth) 100 decision trees (see Fig. 1(a)). As described earlier, in the first stage of leave-one-sample-out and leave-one-group-out cross-validation, we used deltaAGE along with other features mentioned in Table 1 to predict neurocognitive scores mentioned in Table 2. Note that some of the features or target scores are not available for many patients and that is why, we did not include those features and target scores in the present study (see Tables 1 and 2). In the second stage of leave-one-sample-out and leave-one-group-out cross-validation, we train the regression forests again from scratch using the same features as used in stage one, except deltaAGE, to predict neurocognitive scores. Also note that in both stages, we train our regression forest from scratch for each of the neurocognitive scores separately.

Prediction by DNNs
We also used two 5-layered DNNs of different widths (i.e., different numbers of nodes in the hidden layers) in leave-onesample-out and leave-one-group-out cross-validation setups. Like regression forests, in the first stage, we used deltaAGE along with other features mentioned in Table 1 to train DNNs to predict neurocognitive scores mentioned in Table 2. In the second stage, we train those DNNs again from scratch using the same features as used in stage one, except deltaAGE, to predict neurocognitive scores. We used the mean absolute error loss function to train these DNNs defined as: where g and p are the ground truth and the predicted test scores, respectively, and m denotes the total number of training data in a batch. We chose the Adam optimizer with a learning rate of 0.001 to train both DNNs. We also chose a batch size of 16. We implemented our models in PyTorch version 1.6.0 and Python version 3.8.10. The training was performed on the E2 cluster of Boston Children's Hospital using an Intel E5-2650 v4 Broadwell 2.2 GHz processor, an Nvidia Titan RTX GPU with 24 GB of VRAM, and 8 GB of RAM.

Metrics for Prediction Accuracy Evaluation
To evaluate the neurocognition prediction accuracy, we used the Pearson correlation coefficient (r), mean absolute error (MAE), and mean absolute percentage error (MAPE) between the predicted and the ground-truth scores. The Pearson correlation between paired datasets (G, P) : {(g 1 , p 1 ), (g 2 , p 2 ), ..., (g N , p N )} is mathematically defined as 41 : whereḡ andp are the mean of all data points in the ground truth and predicted scores G and P, respectively, and N denotes the total number of test data. In addition, the MAE metric is defined as: where g and p are the ground truth and the predicted values, respectively, and N denotes the total number of test data. Furthermore, the MAPE metric is defined as:  Table 3. Pearson correlation performance between the ground truth and predicted neurocognitive scores by the regression forest, DNN-1, and DNN-2 in leave-one-sample-out cross-validation. The best correlation value for full-scale IQ is shown in blue font and the best mean correlation value is shown in bold font. Acronyms-w/: with, w/o: without, IQ: intelligent quotient.
where g and p are the ground truth and the predicted values, respectively, and N denotes the total number of test data.

Results
In this section, we provide comparative neurocognitive score prediction performance by the regression forest, DNN-1, and DNN-2 in terms of Pearson correlation, MAE, MAPE, and the Wilcoxon signed-rank in leave-on-sample-out and leave-onegroup-out cross-validation setup. Further, we used the data collection sites as well as the diagnosis, i.e., either CHD or control cohort, as groups. We ultimately compared this performance using two feature sets, once with deltaAGE and again without deltaAGE.

Leave-one-sample-out Performance
In Tables 3 and 4, we show the leave-one-subject-out cross-validated prediction performance by the regression forest, DNN-1, and DNN-2. We show the Pearson correlation coefficient (r) between the actual and predicted neurocognitive test scores in Table 3, where we present two sets of results by the regression forest, DNN-1, and DNN-2 for each neurocognitive test. For one set, we combined the brain-age bio-marker (i.e., deltaAGE) with other features (shown in columns with header 'w/ deltaAGE' in tables), while for another set we did not combine deltaAGE with other features (shown in columns with header 'w/o deltaAGE' in tables). We see in Table 3 that prediction performance by the regression forest is overall better than that by the DNN-1 and DNN-2, and the correlation (r) is statistically significant (for p-value=0.05) for many tests. On the other hand, the correlation between the actual scores and DNN-predicted scores is worse and not statistically significant (for p-value=0.05) for any test. Therefore, relying more on regression forest-based prediction, we further see that the prediction performance is better in the occasion when deltaAGE is combined with other features as confirmed by the higher mean of correlation coefficient (see first column under regression forest in Table 3). We also show the MAE and MAPE performance between the actual and predicted neurocognitive test scores for 'with deltaAGE' and 'without deltaAGE' by the regression forest, DNN-1, and DNN-2 for each neurocognitive test in Table 4. Further, we estimated the difference between the actual and predicted scores for 'with deltaAGE' and 'without deltaAGE' cases followed by the Wilcoxon signed-rank test. The Wilcoxon signed-rank test produces a statistic value of 0 when two distributions perfectly match to each other. Otherwise, the statistic value gets larger as the two distribution gets further away from each other. We see in Table 4 that prediction performance in terms of the MAE and MAPE by the regression forest is overall better than that by the DNN-1 and DNN-2 as depicted by the least mean of MAE and MAPE by the regression forest (see first and second columns under regression forest in Table 4). Further, we see that the Wilcoxon signed-rank statistic value is large between the 'with deltaAGE' and 'without deltaAGE' cases for regression forest, which infer that prediction performance for 'with deltaAGE' is better than that for the 'without deltaAGE' case, although this statistic value of not statistically significant (for p-value=0.05).  Table 5. Pearson correlation performance between the ground truth and predicted neurocognitive scores by the regression forest, DNN-1, and DNN-2 in leave-one-group-out cross-validation. In this table, we use data collection sites as groups. The best correlation value for full-scale IQ is shown in blue font and the best mean correlation value is shown in bold font. Acronyms-w/: with, w/o: without, IQ: intelligent quotient.

Leave-one-group-out Performance
We further tested the neurocognitive score prediction performance by the regression forest, DNN-1, and DNN-2 in leave-onegroup-out cross-validation. We considered two different attributes for two different group-based cross-validation. First, we used data collection sites as groups, and second, diagnosis (i.e., CHD and control cohorts) as groups. In the following sections, we present those findings.

Cohorts of Seven Data Collection Sites as Groups
In Tables 5 and 6, we show the leave-one-group-out cross-validated prediction performance by the regression forest, DNN-1, and DNN-2 for 'with deltaAGE' and 'without deltaAGE' cases. We show the Pearson correlation coefficient (r) between the actual and predicted neurocognitive test scores in Table 5, where see that prediction performance by the regression forest is overall better than that by the DNN-1 and DNN-2, and the correlation (r) statistically significant (for p-value=0.05) for several tests including 'Full-scale IQ.' On the other hand, the correlation between the actual scores and DNN-predicted scores is worse and not statistically significant (for p-value=0.05) for any test. Furthermore, we see that the prediction performance by the regression forest is better for the 'with deltaAGE' than the 'without deltaAGE' case (see first column under 'Regression Forest' in Table 5). We further show the MAE and MAPE performance between the actual and predicted neurocognitive test scores for 'with deltaAGE' and 'without deltaAGE' by the regression forest, DNN-1, and DNN-2 for each neurocognitive test in Table 6 We also estimated the difference between the actual and predicted scores for 'with deltaAGE' and 'without deltaAGE' cases followed by the Wilcoxon signed-rank test. We see in Table 6 that prediction performance in terms of the MAE and MAPE by the regression forest is overall better than that by the DNN-1 and DNN-2 as depicted by the least mean of MAE and MAPE by the regression forest (see first and second columns under 'Regression Forest' in Table 6). Further, we see that the Wilcoxon signed-rank statistic value is large between the 'with deltaAGE' and 'without deltaAGE' cases for regression forest, which infer that prediction performance for 'with deltaAGE' is better than that for the 'without deltaAGE' case, although this statistic value of not statistically significant (for p-value=0.05).

CHD and Control Cohorts as Groups
In Tables 7, 8, and 9, we show the leave-one-group-out cross-validated prediction performance in terms of Pearson correlation coefficient (r) between the actual and predicted neurocognitive test scores by the regression forest, DNN-1, and DNN-2, respectively, for 'with deltaAGE' and 'without deltaAGE' cases. In these tables, we used the CHD and control cohorts as groups. We see in these tables that prediction performance by all the approaches (i.e., regression forest, DNN-1, and DNN-2) are found to be better for 'without deltaAGE' case and when control cohort is used for training and CHD cohort for validation, as depicted by the best mean r in respective tables of regression forests, DNN-1, and DNN-2 (see the second last columns under 'LOGO (Training: Control, Test: CHD)' in Tables 7, 8, and 9). In addition, the prediction performance in terms of r is found to be the best for the DNN-2 approach (see Table 9), although we see correlation performance to be statistically significant (for p-value=0.05) for fewer tests than we found for the regression forest in leave-one-sample-out cross-validations in Table 3, and leave-one-group-out (where, data collection sites as group) cross-validations in Table 5.
We also show the MAE and MAPE performance between the actual and predicted neurocognitive test scores for 'with deltaAGE' and 'without deltaAGE' by the regression forest, DNN-1, and DNN-2 for each neurocognitive test in Tables 10, 11, and 12, respectively. We further estimated the difference between the actual and predicted scores for 'with deltaAGE' and 'without deltaAGE' cases followed by the Wilcoxon signed-rank test and showed the Wilcoxon statistic and associated p-value in Tables 10, 11, and 12. We see in Table 6 that prediction performance in terms of the MAE and MAPE by the regression forest is overall better for 'without deltaAGE' case and when the control cohort is used for training and CHD cohort for validation, as depicted by the least mean MAE and mean MAPE. On the other hand, for DNN-1 and DNN-2, prediction performance in terms of the MAE and MAPE is overall better (i.e., least MAE and MAPE) for 'without deltaAGE' case but when CHD cohort is used for training and control cohort for validation (Tables 11 and 12). Thus, irrespective of the training and validation group, all approaches, i.e., regression forest, DNN-1, and DNN-2, performed better in prediction for the 'without deltaAGE' case. This is the opposite finding of what we found in the leave-one-subject-out and leave-one-group-out (the group being the data collection site) cross-validation setup as seen in Tables 4 and 6. Further, we see that the Wilcoxon signed-rank statistic value is smaller (< 500) between the 'with deltaAGE' and 'without deltaAGE' cases for regression forest, DNN-1, and DNN-2, compared to those (> 1500) for leave-one-subject-out and leave-one-group-out (group being data collection site) cross-validation setup as seen in Tables 4 and 6. It infers that prediction distributions for 'with deltaAGE' and 'without deltaAGE' are closer to each other when the leave-one-group-out setup uses CHD and control cohorts as groups.

Full-scale IQ Prediction Performance
Full-scale IQ is typically used to assess human general intelligence, the fundamental ability that combines all subdomains of neurocognitive abilities 42 Table 8. Pearson correlation performance between the ground truth and predicted neurocognitive scores by the DNN-1 in leave-one-group-out (LOGO) cross-validation. In this table, we used the CHD cohort for training and the control cohort for validation, and vice versa. The best correlation value for full-scale IQ is shown in blue font and the best mean correlation value is shown in bold font. Acronyms-w/: with, w/o: without, IQ: intelligent quotient.

8/14
Tests LOGO  Table 9. Pearson correlation performance between the ground truth and predicted neurocognitive scores by the DNN-2 in leave-one-group-out (LOGO) cross-validation. In this table, we used the CHD cohort for training and the control cohort for validation, and vice versa. The best correlation value for full-scale IQ is shown in blue font and the best mean correlation value is shown in bold font. Acronyms-w/: with, w/o: without, IQ: intelligent quotient.  available with the PCGC dataset we used in this study (see Table 2). However, the full-scale IQ provides an overall quantification of a person's general intelligence. Furthermore, full-scale IQ is not calculated by typical averaging of all subdomain scores, but rather by employing factor load analysis, resulting in heterogenous subdomain contributions to full-scale IQ 42 . Therefore, we also check the effect of brain-age bio-marker in the prediction of the full-scale IQ in this study. In Fig. 2, we show plots of the best predicted full-scale IQ scores in each of the three cross-validation setups. We observe in all three cross-validation setups in Figs. 2(a-c) that the prediction models performed better with statistical significance (p-value=0.05) when the feature set included deltaAGE, although overall performance in terms of mean correlation over all the test sometimes differed as seen Tables 7, 8, and 9. Best prediction (a) by the regression forest in the leave-one-sample-out cross-validation, (b) by the regression forest in leave-one-group-out cross-validation, where a group is the data collection site, and (c) by the DNN-1 in leave-one-group-out cross-validation, where a group being either CHD or control cohort.

Discussion
In this study, we conducted experiments to quantify the effect of brain-age bio-marker in predicting neurocognitive scores in adolescents and young adults with CHD. To perform this test, we employed regression forest and two DNNs on demographic, socioeconomic, and genetic factors. Our findings demonstrate the potential of leveraging MRI-based brain-age bio-marker for predicting neurocognition in patients with CHD. However, this investigation also presented us with several unanswered questions. For example, the size of the training data remains a concern when using deep neural network frameworks, as 10/14 Figure 3. Ditribution of the ground truth scores of different neurocognitive tests.
discussed by Richter et al. 44 . We had only 89 data samples in this study, which may not be sufficient to draw definitive conclusions. Despite these lingering questions, our study reveals that brain-age bio-marker can aid in predicting neurocognitive scores. Several key points arising from our findings warrant further discussion:

Pearson correlation vs. MAE
The Pearson correlation coefficient (r) seems more reliable than the MAE and MAPE metrics in neurocognition prediction accuracy estimation. The distribution of neurocognitive scores in our dataset follows a Gaussian-like distribution (see Fig. 3). As a result, a central tendency of the predicted scores towards the mean results in a low MAE and MAPE, although predictions were often inaccurate. Therefore, we considered r as a better indicator of the accuracy of the predicted IQ scores. In addition, the associated p value indicates the statistical significance of the estimated correlation.

Comparison to Structural MRI-based State-of-the-arts on Smaller Data
Several studies [45][46][47] predicted the full-scale IQ score, which showed a correlation of 30-70% (p < 0.01) between the ground truth and estimated absolute full-scale IQ scores. These studies used a dataset of size less than 250 healthy subjects with an age distribution of 6-27 years. Our dataset consists of 89 patients with an age distribution of 7-30 years, and we achieved the best correlation between the actual and predicted full-scale IQ of 37% (see Table 3 and Fig. 2(a)) but without using structural MRI directly, rather employing demographic, socioeconomic, and genetic factors.

Test of Hypothesis
In this paper, we tested the hypothesis that combining demographics, socioeconomic, or genetic factors, and adding a brain MRI-based quantified severity of accelerated brain aging, can better predict neurocognitive outcomes than without the brain age biomarker. In our results, we observed that the mean correlation coefficient is better for the 'with deltaAGE' case than the 'without deltaAGE' in the leave-one-subject-out and the leave-one-group-out (where the group is the data collection site) as seen in Tables 3 and 5. However, in the leave-one-group-out (where the group is the CHD/control cohort) setup, the mean correlation coefficient is better for the 'without deltaAGE' case than the 'with deltaAGE' (see Tables 7, 8, and 9). Since, 1. The Pearson correlation is preferable to the MAE and MAPE metrics in this study (discussed earlier), 2. The size of the training data remains a concern when using DNN frameworks (discussed earlier), 3. Leave-one-group-out cross-validation with considering CHD/control cohorts as groups forced us to keep out about 50% of the data for validation, resulting in the smallest data size for training among our three cross-validation setup, and 4. Full-scale IQ is not calculated by typical averaging of all subdomain scores, but rather by employing factor load analysis, resulting in heterogenous subdomain contributions to full-scale IQ (discussed earlier),

11/14
We can consider Pearson correlation performance on full-scale IQ prediction by the leave-one-subject-out and the leave-onegroup-out (where the group is the data collection site) more trustworthy than the overall mean correlation. Based on this consideration, we see in Fig. 2 that the Pearson correlation (r) for full-scale IQ prediction is better for 'with deltaAGE' than the 'without deltaAGE' in all three cross-validation setups. In addition, all the correlations between the ground truth and predicted full-scale IQ corresponding to 'with deltaAGE' cases are statistically significant (p-value < 0.05). Thus, our hypothesis that adding brain-age bio-marker to demographic, socioeconomic, and genetic factors in predicting neurocognition in adolescents and young adults with CHD stands true.

Limitations
While our paper presents valuable findings, it is important to acknowledge several limitations. Firstly, our study employed a relatively small sample size of only 89 patients. Increasing the sample size would enhance the statistical power and broaden the generalizability of our results. Secondly, by amalgamating data from both control and CHD groups, we may have introduced confounding variables, thereby restricting our ability to make specific conclusions about each group. Future investigations should contemplate analyzing these groups separately to gain a more precise understanding of the distinct contributions of brain structure to intelligence within each population. Furthermore, our analysis exclusively relied on a single dataset, potentially limiting the applicability of our findings to other populations or imaging protocols. To ensure the robustness of our results, it would be beneficial to validate them using multiple independent datasets. In addition, our study exclusively employed regression forest and DNN architectures and did not explore the potential advantages of utilizing alternative models such as support vector regressors or Vision Transformers (ViTs). Assessing various learning approaches could yield valuable insights and potentially enhance predictive performance. Lastly, our study concentrated solely on demographic, socioeconomic, genetic, and MRI-based brain-age biomarkers, omitting actual MRI features that could contribute to a more comprehensive understanding of the relationship between brain structure and intelligence. Future research should consider integrating these additional data sources to offer a more holistic perspective on intelligence prediction.

Conclusion
In conclusion, this study provided valuable insights into the prediction of neurocognitive outcomes in CHD patients. Our results highlighted the utility of including a brain MRI-based bio-marker, deltaAGE, in predictive models, showing consistent improvements in prediction performance. However, it is essential to acknowledge several limitations, such as the relatively small sample size and the amalgamation of data from both control and CHD groups, potentially introducing confounding variables. Future research should address these limitations by increasing sample sizes, analyzing groups separately, and validating findings with multiple independent datasets. Moreover, exploring alternative machine learning models could offer further improvements in predictive accuracy. Additionally, integrating actual MRI features into the analysis could provide a more comprehensive understanding of the relationship between brain structure and intelligence. Overall, this study contributes to our understanding of neurocognitive prediction in CHD patients and paves the way for further research in this field.