Novel age-dependent cortico-subcortical morphologic interactions predict fluid intelligence: A multi-cohort geometric deep learning study

Brain structure is tightly coupled with brain functions, but it remains unclear how cognition is related to brain morphology, and what is consistent across neurodevelopment. In this work, we developed graph convolutional neural networks (gCNNs) to predict Fluid Intelligence (Gf) from shapes of cortical ribbons and subcortical structures. T1-weighted MRIs from two independent cohorts, the Human Connectome Project (HCP; age: 28.81±3.70) and the Adolescent Brain Cognitive Development Study (ABCD; age: 9.93±0.62) were independently analyzed. Cortical and subcortical surfaces were extracted and modeled as surface meshes. Three gCNNs were trained and evaluated using six-fold nested cross-validation. Overall, combining cortical and subcortical surfaces yielded the best predictions on both HCP (R=0.454) and ABCD datasets (R=0.314), and outperformed the current literature. Across both datasets, the morphometry of the amygdala and hippocampus, along with temporal, parietal and cingulate cortex consistently drove the prediction of Gf, suggesting a novel reframing of the morphometry underlying Gf.


Introduction
Pioneering work by Binet and Simon suggested that individuals with higher intelligence learn more quickly (faster reaction time) and effectively (higher accuracy) across a broad array of tasks, from naming objects to defining words, drawing pictures, and solving analogies (1,2). Spearman synthesized these observations into the hypothesis of a generalized intelligence factor (g) which reflects abstract thinking and includes the ability to acquire knowledge, adapt to novelty, develop abstract models, and benefit from schooling and experience (3,4). Further work by Cattell (5) split g into fluid intelligence (Gf), which is the capacity to solve novel problems and abstract reasoning, and crystallized intelligence (Gc) which relates to accumulated knowledge (6)(7)(8). Although Gc and Gf are related and rapidly develop in childhood until adolescence, Gf reaches its steady state prior to a delayed declination whereas Gc continues growing throughout the lifespan (9,10). Of these, Gf has been shown to positively correlate with a vast number of cognitive activities, and to be an important predictor of both educational and professional success (11), raising questions as to how to optimally predict it in the context of development and senescence for educational and health optimization purposes, and how to do so given the variability of contributions to Gf from different regions of the brain, including subcortical grey matter regions of the medial temporal lobe and basal ganglia that are largely ignored in the literature but are fundamental for working memory which is associated with Gf, and judgment and decision-making which relates to the use of abstract modeling of value (12,13).
Recent work has sought to understand the neural substrates of Gf as one approach to predicting and calibrating it. This work has focused on a broad array of neuroimaging modalities and lesion models, each of which has its limitations. Studies with functional imaging of cognitive tasks, or of synchrony between resting state oscillations in blood-oxygen level dependent (BOLD) signal, have tended to focus on fronto-parietal networks responsible for integrating sensory and executive functions (14) in the form of the parieto-frontal integration theory (P-FIT) (15), or combinations of lesion and imaging work (16) to explore multiple demand (MD) system contributions to Gf (17). In the context of structural brain imaging (i.e., morphometry), studies have evaluated the correlation between brain size and Gf (18), or evaluated the contribution of specific cortical areas and white matter fiber bundles to Gf. This research has identified associations between Gf and cortical morphology such as cortical thickness, cortical area, cortical volume, gyrification and grey matter density (19)(20)(21). Individual cortical morphological metrics have shown rather modest correlations with Gf, and cortical morphometric features correlating with Gf have shown moderate overlap as illustrated by Tadayon and his colleagues. (21). Although this study reports a correlation between the shape of basal ganglia structures and Gf (22), the relative contribution of subcortical structures involved with emotional memory (e.g., amygdala) and judgment and decision making components of the basal ganglia (e.g., nucleus accumbens) have not been investigated, nor have they been evaluated against cortical contributions, which include areas outside of fronto-parietal networks, such as the temporal cortex (23).
Multiple approaches exist for assessing grey matter brain structure: (1) volume (regional volumes of deep grey matter structures) to investigate gross volumetric differences, (2) grey matter density using voxel-based morphometry, (3) shape deformation (surface topology) to investigate localized shape differences. Of these approaches, shape analyses allow for a comparison of surface geometrical properties of structures between groups or with behavior (24) that may not have had an overall volume change or alteration in grey matter density, and thus may be very sensitive to subtle changes in their relationship to behavior, diagnosis and development (25,26).
For instance, exposure-dependent deformations may precede more pronounced volumetric changes with illicit drug exposure (27). Other work has shown that subcortical deformations have been reported in thalami of patients with schizophrenia (28)(29)(30)(31), obsessive-compulsive disorders (31), Parkinson's disease (32), and Tourette's syndrome (33). Of surface geometrical measures, thickness is a topographical measure that is an indicator of the integrity of cytoarchitecture in the cortex (34), and of all the topographical measures that can be made, cortical thickness is the most invariant brain-size parameter across mammalian evolution (35,36).
Neocortical enlargement depends primarily on growth of surface area (35,37,38), which thus makes cortical surface measures important in considering similarities across cohorts with significant differences in mean age, if one is going to identify consistent features of brain morphometry related to Gf.
Given these considerations, and the dearth of research on (i) both deep grey matter contributions and cortical contributions to Gf, (ii) absence of work of what is common across disparate age groups, the focus of our work was three-fold. First, we aimed to identify the most predictive brain morphometric features of Gf. Due to the challenges inherent in modeling all the relevant cortical morphologic features and the limited predictive power of individual cortical morphologic features, we used a datadriven approach capable of identifying complex non-linear relationships, potentially across remote brain regions, and implicitly encompassing multiple morphometric features such as cortical thickness, cortical area and gyrification, as well the shape of subcortical structures. The second aim of our study was to assess the contribution of the subcortical structures to Gf either alone or combined with cortical morphology.
The third aim specifically focused on investigating how age might be involved in the prediction of Gf. For these purposes, we developed a novel geometric deep learning method capable of extracting relevant cortical and subcortical morphological features (39). Our method was data-driven and relied on cortical and subcortical surface models, extracted from automated analysis pipelines, as an input for a graph convolutional neural network (gCNN) to infer Gf. Using a 6-fold cross-validation scheme and two large independent datasets, we evaluated the robustness of our method and the reproducibility of the predictions across two different age groups.
Finally, a gradient-based backpropagation method allowed us to map the most predictive cortical and subcortical regions involved in the correct prediction of Gf. Table 1 summarizes the comparative performance of each of the three proposed models used to predict fluid intelligence on the HCP testing dataset across all six folds. Fig. 1 shows the distribution of predictions for each model. All three models were able to successfully predict fluid intelligence scores. However, use of both cortical and subcortical surfaces together achieved the best performance (MSE = 0.834, R = 0.454, = 6.2 × 10 −57 ), followed by using only the cortical surface data (MSE = 0.886, R = 0.381, = 2.6 × 10 −39 ) with use of subcortical surface data alone producing the least accurate results (MSE = 1.014, R = 0.155, = 2.3 × 10 −8 ).  Interestingly, the overall performance for fluid intelligence prediction was better on the HCP dataset compared with ABCD dataset.

ABCD Dataset Fluid Intelligence Prediction
Since we used six-fold cross-validation, we had six separate testing datasets and in total 15 correlations were calculated respectively for each inner-fold input and two correlations were calculated inter-cohort. More details can be found in Fig. S1. Table   S1 showed the averaged correlations. In each inner-fold, cortical structures showed higher correlations than subcortical structures on both datasets. The correlations on inner-fold inputs in each dataset were higher than those on the inter-cohort correlations across two datasets.

Mapping Interpretation
In order to provide some interpretability to our models' performance, we applied a gradient backpropagation-based visualization method (grad-CAM) to visualize the brain areas most relevant to fluid intelligence prediction. Fig. 3 and Fig. 4 show the average maps of the testing sets from both the HCP and ABCD datasets.  Fig. 4A, B demonstrate that cortical structures play a more significant role than subcortical structures in the prediction of fluid intelligence score, which is in keeping with our statistical results. The topographic distribution of relevant brain structures is largely conserved with particular weight placed on the left temporal and parietal lobes in the prediction across both datasets. Interestingly, the morphology of the left temporal lobe was weighted more heavily in the prediction using the HCP dataset whereas the left parietal lobe was weighted more heavily in the prediction using the ABCD dataset. Other cortical structures including the bilateral paracentral lobules and posterior cingulate gyri were also relevant to the prediction but to a lesser degree. As the results in Table 1 and Table 2 demonstrate, there is more broadly distributed involvement of the subcortical structures in the prediction of fluid intelligence from the ABCD dataset while the subcortical structures are somewhat less involved in prediction from the HCP dataset.
Results from the models using only cortical surface data or only subcortical surface data were similar in distribution but variable in degree when compared to results from the model using both cortical and subcortical surface data together.

Discussion
This study had three aims. These aims were to (a) identify the most predictive brain morphometric features of Gf, (b) assess the contribution of the subcortical morphometry to Gf either alone or combined with cortical morphology, and (c) investigate how age might be involved in the prediction of Gf. Our Analysis utilized a novel deep learning model using residual gCNNs to infer Gf from cortical and subcortical surface models that integrated multiple morphometric features such as cortical thickness, cortical area and gyrification, as well the shape of subcortical structures. Using two large and independent datasets of pre-adolescent (ABCD project) and young adults (HCP dataset) and a nested six-fold cross-validation scheme, this analysis predicted Gf with significant correlations (R=0.31-0.45). Across both datasets, the right amygdala and hippocampus, left temporal and parietal cortex, and bilateral cingulate morphometry consistently drove the prediction of Gf. Given the uniqueness of these findings, particularly for the amygdala and temporal cortex, localization was confirmed using grad-CAM to show reproducibility across subcortical surfaces and gyral folds. Divergence between the datasets was observed whereby the left hippocampus and amygdala, right caudate, right nucleus accumbens (NAc) and right pallidum also helped predict Gf for the younger ABCD cohort, with subcortical structures alone producing an R ~ 0.27, and cortical structures alone producing an R ~ 0.3. Together, subcortical and cortical structures produced an R ~ 0.31 for the younger ABCD cohort, whereas for the older HCP cohort, the combined was higher (R ~ 0.45). For the older HCP cohort, furthermore, subcortical structures alone producing an R ~ 0.1, and cortical structures alone produced an R ~ 0.4, with the right rectus gyrus helping predict Gf for the older HCP cohort but not the ABCD cohort. In both datasets, significantly better predictions were thus obtained by combining the cortical and subcortical surfaces. Despite this commonality regarding improved Gf prediction through combining cortical and subcortical morphometric features, the younger ABCD cohort had a substantially larger contribution from medial temporal and basal ganglia brain regions than the older HCP cohort, which needs to be considered in the context of differences in the trajectory of brain development between the late latency/pre-adolescence (ABCD cohort) and late adolescence/young adulthood (HCP cohort).

Predictive models of Gf
Fluid intelligence refers to the ability to solve novel reasoning problems, which is considered independent of experience and education and, as such, believed to be biologically grounded on neurodevelopment (40). Previous studies have reported an age-related performance in Gf, peaking in late adolescence and declining in adulthood (41). In this study, we included two datasets of subjects at distinct phases of cognitive maturation. The younger cohort, the ABCD dataset, included children from 9 to 11 years, an age at which fluid intelligence has not yet reached its putative maximum. In this cohort, we predicted Gf with R= 0.328, which, to our knowledge, further improves the prediction accuracy so far reported using this dataset (42)(43)(44)(45)(46). Using Kernel Ridge Regression classifiers and CNNs, Mihalik and colleagues used manually extracted voxel-wise brain features (as opposed to automated morphometric analysis) on the ABCD dataset and predicted residualized Gf with an R = 0.17 (43), while Li and colleagues XGBoost classifiers on brain volumes and cortical curvatures to predict Gf with an R = 0.18 (47). The current work substantially builds on these ground-breaking reports, while also identifying brain region features, such as amygdala shape, not previously reported to be involved with Gf.
More studies have attempted to predict fluid intelligence using the HCP dataset.
The age of subjects in the HCP dataset ranges from 22 to 35 years old, which corresponds to a different maturational phase when fluid intelligence is close to its full potential (48). All previous studies predicting fluid intelligence in the HCP dataset have done so using functional MRI (fMRI) (49)(50)(51)(52). For instance, using functional connectivity analysis of task-based fMRI (FC), Greene and colleagues reached an R = 0.17 (53). Combining FC with resting-state fMRI (rs-fMRI), Elliott and colleagues obtained an R = 0.325 (54). Jiang and colleagues integrated multi-task FC features, applying partial least square regression method to improve the accuracy to an R = 0.409 (55). The current work compares favorably with these previously reported state-of-the-art functional imaging methods, by achieving an R = 0.454, using T1 weighted anatomic MRI data without any behavioral or functional imaging data. Our results support an association between brain morphometry and Gf (21).
Moreover, we found that this association was strengthened when both cortical and subcortical structures' shapes informed our gCNNs, underpinning the interdependencies across remote brain regions that in our review of the literature has not been reported.

Cortical and subcortical regions involved in the prediction of Gf
The degree of involvement of temporal, parietal, and cingulate cortex, as revealed by grad-CAM, was highly reproducible across folds and displayed remarkable similarities between the two independent datasets. Specific cortical regions for both datasets included the left posterior middle and inferior temporal gyri as well as left basal temporal cortex, left temporo-parietal junction at the posterior aspect of the Sylvian fissure, left posterior cingulate, left interhemispheric paracentral lobule and the right cingulate region. At the cortical level, the only differentiator between the two datasets was the right rectus gyrus, whose morphometry predicted Gf in the HCP dataset but not in the ABCD dataset. These morphometric findings of temporal, parietal, and cingulate cortex add complexity to current frameworks for Gf that focus on fronto-parietal networks involved with combining sensory and executive material (14) as with parieto-frontal integration theory (P-FIT) (15,17). The fact that morphometric features of the temporal, parietal, and cingulate cortex were observed in two independent cohorts raises many issues about integrating such structural imaging findings with functional imaging findings that focus on other regions of the prefrontal cortex, particularly as the current structural findings have as strong a predictive outcome as any study to date using functional imaging data, and question the current focus of many studies of Gf just on fronto-parietal networks.
Prior work of the neuroanatomic substrate of Gf has identified associations between widespread cortical areas, but relatively few relationships were reported with subcortical structures. The subcortical structure that appears to have had the most associations with Gf is the hippocampus. Raz and colleagues reported smaller hippocampal volume being associated with Gf (56) while Amat and colleagues reported smaller hippocampal volume being associated with full-scale intelligence quotient (IQ) and IQ subscales (57). Others reported hippocampal volume predicting Gf only in musically trained people (58), and the volumes of hippocampal subfields being more relevant for Gf than working memory (59), even though working memory has been linked to Gf (11). The current findings support the prior work, particularly in the context that increased Gf prediction resulted when subcortical regions such as the hippocampus were combined with cortical regions; this work resembles but does not replicate others who have reported that rs-fMRI connectivity between the right hippocampus and medial prefrontal cortex was associated with Gf (60). The current work further indicates how important it is to consider hippocampus morphometry in the context of the morphometry of other subcortical regions, particularly those with minimal association to Gf in the literature, that also have been linked to other cognitive science literatures, such as reward processing in judgement and decisionmaking and emotion regulation (e.g., nucleus accumbens and amygdala).
It appears a relatively smaller number of studies have linked Gf to morphometric measures of regions of the basal ganglia, such as the caudate and NAc (22), or suggested that Gf can be segregated from Gc based on NAc volume (61). The current work adds to these studies by indicating the bilateral NAc was important for predicting Gf in latency stage individuals (ABCD cohort), but contributes to the prediction of Gf to a lesser extent than cortex in late adolescents/young adults (HCP cohort). The NAc has been a fundamental target of addiction research (62), social reward studies (63,64) and neuroeconomics (65), with a consensus sentiment that it is a core region for the judgement of value, that is fundamental for decision-making (66,67). In this context, the NAc has also been considered important for allocation of effort, as with effortful cognitive tasks and motivation (12), and has been implicated in "grit" or the ability to persevere in a motivated fashion under adversity (68). The NAc is a critical target of dopaminergic cells in the brainstem (67), that make it important for motivated behavior, and suggest it would be important for allocating effort to the solution of novel reasoning problems that define Gf.
Related to the function of motivation, and heavily interconnected with the NAc (12) the amygdala has been considered a core region for emotion regulation, such as the experience and control of fear (69). To date, we cannot find any studies in the literature that implicate the amygdala with Gf, despite multiple studies implicating other regions with Gf that are contiguous with the amygdala (e.g., hippocampus) or significantly interconnected with it (e.g., NAc). Gf has been implicated with connectivity related to the uncinate fasciculus, a white matter bundle that connects the amygdala and anterior temporal cortex with frontal regions (70), but not directly connected to amygdala morphometry. The current findings across two independent cohorts of amygdala morphometry predicting Gf, might be consistent with a role in emotion regulation facilitating the solution of novel problems and adapting learning to new circumstances.
In parallel with considering the location of morphometric changes observed in this study, it is important to consider the complexity involved with morphometry as a field, including the number of independent features measured by voxel-based morphometry, cortical thickness, and volumetrics (19,23,26,27,71,72). The analysis of the specific contributions of cortical thickness, cortical area and gyrification to Gf can reveal large topologic variations depending on the cortical morphometry employed, and resulting in sometimes contradictory results that suggest limitations to the specificity of each measurement individually (21,73,74). Using a data-driven approach which is agnostic to the individual morphologic features of the brain's shape, the approach used in this study identified robust and well-localized involvement of both cortical and subcortical regions. Although the exact nature of the inferred morphometric features is not known using this approach, the network has the ability to identify interactions across individual morphologic features including cortical thickness, cortical area and gyrification, as well as to integrate features related to the shape of subcortical structures in its learning process. It can also take into account subtle and non-linear inter-regional interactions that contribute significantly to an individual's Gf. Multiple brain regions previously reported in the literature using individual morphologic feature analysis were not revealed to play a role in the prediction of Gf using the current approach. One explanation for this is that our method integrates multi-dimensional interactions across individual morphologic features into its prediction, and the mapped results identified the most relevant brain regions taking these interactions into account.

Differences in topographic prediction of Gf across age groups
Gf increases rapidly from birth through late adolescence, when it reaches a plateau which is sustained through the third decade of life, followed by a slow decay over the remaining lifespan (75). This trajectory parallels that of grey matter pruning in the cortex, which is much more pronounced in latency-aged children (e.g., ABCD cohort) relative to young adults (HCP cohort). Throughout adolescence, a strong relationship between cortical and subcortical development has been noted with cognitive performance (76,77). Stress and emotional strain from adverse familial, educational, and social events over childhood and adolescence can also modulate the rate of growth in Gf (78). One might thus expect larger inter-subject variability in a younger population when Gf is still in its developmental phase rather than in a young adult population when its changes is asymptotic. Our results could be consistent with this interpretation in that we achieved a higher R in predicting Gf for late adolescents/young adults (HCP cohort) relative to latency stage/pre-adolescent children (ABCD cohort). At the same time, the cortical brain regions involved in the prediction of Gf remained consistent across age groups as revealed by grad-CAM visualization, despite the differences in predictive accuracy. Two other issues also should be noted. Namely, that neurodevelopment impacts the capacity to modulate cognitive behaviors important for Gf (79,80). Furthermore, subjects from the HCP dataset were all healthy adults while the ABCD dataset included children with a broad array of risk factors for developing mental health and addictive disorders, which can impact Gf (81,82) Differences in the discrepancy in accuracy across datasets likely represents contributions from a combination of the brain's developmental trajectory as well as potential cognitive vulnerabilities across the health spectrum.
Between these two cohorts, subcortical structures played a more prominent role in the prediction of Gf in latency-stage/pre-adolescent children than in young adults/late adolescents. Across both cohorts, only the head of the right hippocampus and the right amygdala consistently contributed to the prediction of Gf. For the younger subjects (ABCD cohort), the left hippocampus and amygdala were also important for the prediction of Gf, along with the right caudate, NAc, and pallidum.
The observation of bilateral hippocampi with the ABCD cohort is consistent with suggestions that working memory may be particularly important for Gf in children (83). In the developing brain, associations between fluid reasoning and subcortical shape has been reported to be widespread, encompassing the bilateral putamen, pallidum and caudate (84), consistent with the current findings. Our findings involving the right striatum are in keeping with other reports of asymmetric rightsided striatal dominance in younger individuals compared to older individuals (85).
Lastly, it needs to be noted that medial temporal structures and the striatum have strong connections to frontal and cingulate cortices (86), and corticostriatal circuits (87). Through such connections medial temporal structures and the striatum have been implicated with executive function (88), context coding (89) and impulse control (90), which are important processes for adaptation to novelty with Gf.

Datasets
Brain MRI and neurocognitive data from two publicly available datasets were used independently in this work: the Human Connectome Project (HCP) S1200 data release and the Adolescent Brain Cognitive Development Study (ABCD) 2.0 release (82,91). The HCP dataset consists of neurobehavioral measurements and MRI scans from healthy subjects aged between 22 to 35 years. Subjects were defined as "healthy" in the absence of diagnosed neurologic or psychological conditions. All subjects were scanned on a custom Siemens 3T Connectome Skyra at Washington University using a standard 32-channel Siemens head coil. Further details pertaining to the included subjects, data collection parameters and preprocessing steps can be found on the HCP website (82,92). The ABCD dataset consists of neurobehavioral measurements and MRI scans from over 11,000 children aged between 9 and 11 years. Subjects from across the United States with diverse health, socioeconomic and ethnic backgrounds were included. Brain MRI data were acquired from three different 3T scanner platforms: Siemens Prisma, General Electric 750 and Phillips. Further details pertaining to the included subjects, data collection parameters and preprocessing steps can be found on the ABCD website (91). Minimally preprocessed T1-weighted MRI scans were obtained from both databases.
In addition to brain MRI data, Gf scores, measured by the NIH Toolbox Neurocognition battery were collected. Specifically, the nihtbx_fluidcomp_uncorrected variable was included from the ABCD dataset and the Test, and the Toolbox Pattern Comparison Processing Speed Test). The raw Gf scores from two datasets were quantile normalized at first in order to assume the Gaussian distribution of each dataset. Quantile normalization was realized by sorting the scores of each subject from low to high and replacing them with a random standard Gaussian distribution which was also sorted from low to high. The characteristics of two datasets are summarized in Table 3.

MRI Data Preprocessing
For each subject, inner cortical surfaces (modeling the interface between grey and white matter) and outer cortical surfaces (modeling the cerebrospinal fluid/grey matter interface) were extracted using Freesurfer (v6.0, https://surfer.nmr.mgh.harvard.edu).
Seven subcortical structures per hemisphere were automatically segmented using Freesurfer (amygdala, nucleus accumbens, caudate, hippocampus, pallidum, putamen, thalamus) and then modeled into surface meshes using SPHARM-PDM (https://www.nitrc.org/projects/spharm-pdm/). All surfaces were inflated, parameterized and registered to a corresponding surface template using a rigid-body registration to preserve the anatomy of the cortex and subcortical structures (94). No morphometric evaluation of subcortical structures, re-segmentation, or use of multiple atlases was performed; this study sought to minimize variance from analysis of the feature set used for prediction.
Surface templates were converted to graphs based on their triangulation scheme.
Nodes of the graphs were surface vertices, and edges of the graphs were segments across vertices. Overall, the graphs including all structures had 47,616 nodes, 32,768 from the cortical surfaces and 14,848 from the subcortical surfaces. Input features of the network were defined as the Cartesian coordinates of surface vertices in subjects' native space resampled into the surface templates. As a consequence, cortical nodes were assigned 6 features (X, Y, Z of both the inner and outer cortical surface vertices) and subcortical nodes had 3 features (X, Y, Z of subcortical surface vertices) when they were used for separate training.
More details about the construction of the common graphs and the organization of input features are provided in Supplemental material and Fig. S1-4. All subjects were represented using the same underlying graphs, the features assigned to the nodes were unique to each subject and were the input of our gCNNs.

Preparation of the datasets
The nested cross-validation was used in this work to assess generalizability, which contains an outer loop of six folds and an inner loop of five folds. Both datasets were split into six folds, randomly selecting one set as the outer test set and the rest five sets as the outer training set. This whole process repeats six times for each fold.
The outer training set was sub-divided into five folds, including one validation set and four inner training sets. This inner process repeated five times and the outer test set was evaluated by an ensembled model averaged from those five trained models. More details are shown in Fig. S2. For HCP dataset, we included 1,097 subjects, i.e., in each fold, 914 inner training sets and 183 outer test sets and for ABCD dataset, we included 8,070 subjects, i.e., 6,725 inner training sets and 1,345 outer test set.

Graph Convolution
Traditional CNNs extract features on structured data, such as 2D images or 3D volumes. The convolution operations over graphs can be generalized in the spectral domain, which is the multiplication of the signal on graphs with the eigenvector matrix of the graph Laplacian (95). We defined un undirected graph = { , ϵ, }, where edge ϵ connects two vertices V, with | | = , and ∈ × is the weighted adjacency matrix. is a square symmetric matrix and is the weight where ∈ is a vector of Fourier coefficient.
In order to largely reduce the computational complexity, we approximate spectral filters using truncated expansions of Chebyshev polynomials (95). The -localized filtering operation is defined as

Image Augmentation
In order to increase the generalizability between our two datasets, we used two data augmentation techniques: random rotation within ±20 degrees and a random Gaussian noise standardized with mean = 0 and a standard deviation of = 0.02. The augmentation parameter denotes the probability of the use of data augmentation for a single subject. In this study, both datasets had = 0.5, indicating that data augmentation was applied with a 50% probability per subject for each iteration of training.

Network Architecture
Fig . 5 shows the details of our graph networks. We used residual blocks in our model to facilitate the training of deeper networks inspired by (96). Using this approach, the output of the previous block is added to the output of the current block to avoid the vanishing gradient problem (97). Our model contains a pre-convolutional layer (Pre-Conv), four residual blocks (ResBlock) and a post residual block, followed by a single fully connected (Fc) layer with one output that reflects the estimated Gf score. Each residual block has two subblocks, including a batch normalization layer (BN), a non-linear activation function ReLU and a normal convolutional layer (Conv). The maxpooling layer is used after each residual block to downsample the features.

Loss function
The loss function used in our model is composed of three parts: the mean square , where 1 , 2 are the regularization parameters, is the predicted score, is the true label, is the covariance and is the standard deviation. This correlation term is added to alleviate the "regression towards or to the mean (RTM)" bias, where the higher the correlation, the lower the loss (98).

Grad-CAM Visualization
To visualize the most relevant brain areas involved in the network's decision making process and to provide some interpretability to our network results, a graphical gradient-weighted Class Activation Map (grad-CAM) method was applied to generate a color-coded probability map (99). Grad-CAM uses the gradient information flowing back to the last convolutional layer of the model to generate heatmaps highlighting important regions upon which the model focuses and then performs a global average pooling operation to produce the importance weights ∈ × , of each neuron: where refers to the score of class , i.e. = 1 in this paper, and represents the value at each node for feature map activations on the last convolutional layer. After getting the weights, is calculated with a weighted combination of feature maps followed by an activation function , which is applied to only take positive weights into consideration and ignore the negative weights since we were interested in the features with positive influence on the class of interest .
Grad-CAM maps were obtained for Gf prediction from each testing set in all six folds. In our models we used four pooling layers, reducing the number of nodes by a factor 2 4 . Therefore, we rescaled the generated grad-CAM maps back to the original size using spherical linear interpolation on the cortical and subcortical surfaces in order to overlay on the original graphs.

Network implementation
Model performance was evaluated using nested cross-validation, with the datasets split into six folds each, where each fold was randomly chosen for testing and the remaining five folds were used for training. In each fold, the outer test dataset was evaluated by an averaged model ensembled from all five inner-folds models, and the grad-CAMs were generated using the average weighted sum on each of the testing subject. More details are shown in Fig. S2.
For the ABCD dataset, the networks were trained using a batch size of 32, and a maximum number of epochs of 100. We used Adam optimizer with a learning rate of 0.0005 and a decay rate of 0.99 per ten steps, the parameters 1 and 2 were both set to 0.0001 and the dropout ratio at the fully connected layer was set to 0.5. For the HCP dataset, the batch size was set to 50 and the parameter 1 was set to 0.0005. Due to the smaller dataset size, the maximum number of epochs for the HCP dataset was set to 80. The different network parameters were optimized using cross-validation and the training process stopped when the generalization error increased with the patience factor set to 5. The networks were implemented in Python (version 3.6) using TensorFlow and trained on a single GPU (Nvidia GeForce 2080Ti) workstation.

Statistical Analysis
The performance of three types of gCNNs were evaluated, using either: 1) only the inner and outer cortical surface nodes, 2) only the subcortical surface nodes, or 3) both inner and outer cortical surface and subcortical surface nodes together. The MSE, Pearson correlation coefficient score (R) and training time yielded for each testing fold and for each complete dataset were calculated. A paired t-test was performed to compare the performance of each of the three input types and the pvalues were adjusted for multiple comparisons using false discovery rate (FDR), which was considered as statistically significant if the p-values < 0.05. The normalized correlation (0 to 1) was calculated on the mapping results ( ) to compare the inter-fold similarity and inter-cohort similarity on the HCP and ABCD datasets.
The closer to 1, the higher the correlation between the two groups. All statistical analysis was performed using the Scikit-learn and Numpy packages in Python (version 3.6).

List of Supplementary Materials
Materials and Methods

Conversion of meshes to graphs
Template surface meshes were converted to graphs using their triangulation schemes (Fig. S3). Nodes of the graphs were defined as the surface vertices, and edges of the graphs were triangles segments across vertices. A weight , was assigned to the edge between neighbor nodes and such as: with , is the Euclidean distance between the nodes and .

Hierarchical decomposition of the graphs
High-resolution template graphs underwent a hierarchical dichotomic partitioning using the following steps and illustrated in Fig. S4: 1. For each structure, define the root leaf as the graph corresponding to the highdensity mesh 2. Repeat these steps until the average distance across neighbor leaves is less than a threshold a. Given a set of leaves at level , partition each leaf at level into two child leaves at level + 1 using spectral clustering (100). The new leaves are therefore subgraphs of their parent leaf; b. Order the 2 +1 leaves of level + 1 such that the leaves 2( − 1) and 2 are partitions of leaf at level ; c. For each leaf of level + 1, identify its center node as the node whose betweenness centrality is largest (101); d. Defines the partition neighbor matrix +1 , of size 2 +1 × 2 +1 , such that √2 where , is the geodesic distance (along the mesh) between the center of the leaf i and the center of the leaf j at level + 1; e. Compute the average distance across leaves' centers and exit the loop if less than , otherwise continue to step a.
The average distance threshold was set to 3 mm for the cortical surfaces and 2 mm for the subcortical surfaces. Therefore, the number of decomposition levels for each structure was a function of their surface area. The number of nodes at the finest level is provided in Table S2.

Aggregation of matrices and features
Depending on the structures used for the prediction, the matrices were block concatenated along the main diagonal to define the whole underlying graph (Fig. S5).
When only the subcortical structures were used for prediction, the input features associated to the nodes of the graph were the 3-dimensional Cartesian coordinates of the corresponding center node in individuals' space [ , , ]. If the cortex only was used to feed the gCNN, then a 6-dimensional vector was assigned to each node of the graph, containing the Cartesian coordinates of the inner and outer cortical surfaces [ , , , , , ]. Finally, when both the subcortical structures and the cortex were used to feed the gCNN, a 9-dimensional vector was assigned to each node of the graph. For nodes of the subcortical structures, the first 3 elements of the input were Cartesian coordinates of the nodes, and the last 6 were zeros. For cortical nodes, the first 3 elements of the vector were zeros, and the last 6 the Cartesian coordinates of the inner and outer cortical surfaces.

Pooling operation
By construction of the hierarchical decomposition of the graph and ordering of the node, the pooling operator is applied similarly to a 1-dimensional signal with a stride of 2 and a pooling size of 2. This is conceptually identical to the original gCNN study (95). However, as opposed to the METIS algorithm initially proposed, our approach guaranties that no singleton is ever generated.