A machine learning tool for interpreting differences in cognition using brain features

Predicting variability in cognition traits is an attractive and challenging area of research, where different approaches and datasets have been implemented with mixed results. Some powerful Machine Learning algorithms employed before are difficult to interpret, while other algorithms are easy to interpret but might not be as powerful. To improve understanding of individual cognitive differences in humans, we make use of the most recent developments in Machine Learning in which powerful prediction models can be interpreted with confidence. We used neuroimaging data and a variety of behavioural, cognitive, affective and health measures from 905 people obtained from the Human Connectome Project (HCP). As a main contribution of this paper, we show how one could interpret the neuroanatomical basis of cognition, with recent methods which we believe are not yet fully explored in the field. By reducing neuroimages to a well characterised set of features generated from surface-based morphometry and cortical myelin estimates, we make the interpretation of such models easier as each feature is self-explanatory. The code used in this tool is available in a public repository: https://github.com/tjiagoM/interpreting-cognition-paper-2019


Introduction
Predicting variability in cognitive functioning can have important consequences in terms of delineating the life trajectory of a person. Individual differences in cognition have been related to important outcome measures like education or occupational achievement, general health, longevity, and risk to develop dementia. One way of predicting cognitive performances at the single subject level is to use different sources of neuroimaging data that assess different aspects of brain function and structure (anatomy). Resting-state functional connectivity [11] [3] or neuro-anatomically extracted features [13] are examples of such sources of neuroimaging data.
However, no one in this field so far has tried to capitalise on multi-modal neuroimaging, i.e. on the possibility to predict individual differences in cognition by simultaneously using different type of information regarding the brain structure or function. In this study, we combined different structural neuroimaging modalities (T1-weighted and T1/T2-derived intracortical myelin estimates) to predict subject-specific scores in a series of cognitive and demographic measures derived from a data-reduction analyses of a large set of behavioural measures, drawn from the Human Connectome Project (HCP).
Machine Learning has recently gained attention in a number of applied biomedical disciplines due to the increased prediction power it can provide. In specifically, neural networks and deep learning methods are the most well-known but they usually bring a lot of complexity when one wants to interpret the results or avoid overfitting [15]. Instead, we focused our attention in a type of gradient boosting decision tree algorithm, XGBoost [1], which have recently achieved top results in applied machine learning competitions using tabular data. As it is built on top of decision trees, it also has the advantage that there is space to try to understand how the model makes its decisions.
To interpret the neuroanatomical basis of cognitive measures, we use recently developed algorithms that improve the interpretability capacity of machine learning models without neglecting their prediction power. In specific, we used an adaptation of SHAP (SHapley Additive exPlanations) [9] which is a unified framework for interpreting predictions without falling in common mistakes that do not bring consistent and true interpretations [10] .
Interpretation of 3D neuroimages can be complex, and usually relies on visual plots and specific neuroanatomical knowledge for interpretation. In this paper, we reduced the different structural neuroimaging modalities by using surfacebased morphometry and parcellated cortical myelin estimates as subject-wise features in XGBoost. By reducing multi-modal neuroimaging measures to a well characterised set of features, we make the interpretation of such models easier as each feature is self-explanatory per se. For example, when interpreting machine learning models that use 3D images as a basis, it is possible to visually identify areas of the brain responsible for the prediction, but it is not obvious whether those regions are related, for instance, to actual values of thickness or surface area.
The main contribution of this paper is thus to show how one could interpret the neuroanatomical basis of cognition, by applying state-of-the-art machine learning methods which we believe are not yet fully and correctly explored in the field. The code used in the resulting tool is publicly available: https://github.com/tjiagoM/interpreting-cognition-paper-2019 2 Methods

Dataset and Participants
Pre-processed structural Magnetic Resonance Images as well as demographic and cognitive data from 1200 subjects were obtained from the Human Connectome Project (HCP) public repository 5 .
After accounting for missing information, the total number of subjects included in our analysis was 905, where around 53% were females, and 47% were males. All participants were young and healthy adults, with a median age of 29, with no hypertension, alcohol misuse, panic disorder, depression, or other psychiatric and neurologic disorder, or history of childhood conduct problems. The majority of people were right-handed white americans with a non-hispanic or latino background (cf. Table 1).

Data and Image Processing
HCP structural T1w images were collected from a 3-Tesla Siemens Skyra unit (housed at Washington University in St. Louis) using an axial T1-weighted sequence (TR = 2400 ms, TE = 2.14 ms, flip angle=8, voxel-size 0.7 0.7 0.7 mm3). The T1w data was passed through the full Freesurfer (v. 5.3) 6 reconstruction stream to calculate cortical thickness, surface area, number of vertices in the cortex, gray volume, integrated rectified mean curvature, integrated rectified gaussian curvature, folding index, and intrinsic curvature index. The optimal pipeline used to obtain these segmentation is described in detail in a previous article [6]. To map all subjects' brains to a common space, namely the Desikan-Killiany Atlas [2], reconstructed surfaces were registered to an average cortical surface atlas released by HCP 7 using a non-linear procedure that optimally aligned sulcal and gyral features across subjects [4].
Myelin maps were generated by the HCP consortium according to Glasser and Van Essen (2011) [5]. Transformation of myelin map data from the individual subject's native mesh to the right fsaverage template (LR) standard mesh involves two deformation maps, one representing registration from the native mesh to the fsaverage left mesh (L) and fsaverage right mesh (R) and another representing registration between L and R and LR. The two deformation maps were concatenated into a single deformation map using Caret software that was applied to the individual subject's myelin map data, cortical thickness data, and surface curvature data. The individual myelin map data were normalised to a group global mean and then averaged at each surface node in order to achieve anatomical correspondence to other structural features.

Factor Analysis
We extracted 9 factors which explained 70% of cumulative variance using Principal Component Analysis (PCA). Hence, twenty-three measures of affect, cognition, and health including self-report questionnaires, neuropsychological tasks, and other behavioural indices from the HCP database were grouped into independent and latent components of emotional behaviour, affect, psychological well-being, cognitive functioning, and general health using factor analysis (FA). Subjectspecific loading values were employed as dependent variables. In Table 2 it is possible to see what each factor is representing.

Cross-Validation Procedure
To predict each factor using our generated features, we implemented a Nested Cross-Validation (CV) procedure as depicted in Figure 1, with 5 outer folds, and 3 inner folds. Essentially, the data is divided in 5 folds, and selecting each fold  once, the other four folds are selected to run an inner loop. For each inner loop the data is divided in 3 folds, where 2 of these folds are used for feature selection and hyperparameter search, selecting the hyperparameters that yield the best Mean Squared Error (MSE) in the remaining fold. After running this process for each one of the 3 folds, the model that yield the lowest mean MSE is selected and used in the fold selected in that outer loop. In other words, this process selects 5 different models for each outer fold.
We use XGBoost [1] as our machine learning method to predict our factors. XGBoost is a effective, scalable gradient boosting tree algorithm, used worldwide in machine learning and data science competitions to achieve the best predicting results in tabular data. It uses several decision trees (models) as an ensemble added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimise the loss when a new tree/method is added.
Regarding the Feature Selection step, in each iteration we selected the 450 values that share more mutual information with the factor being predicted. We used the scikit-learn [12] implementation as originally described in Kraskov et al. [8] and Ross' [14] papers.

Feature Importance Analysis
Given the complexity of the XGBoost, and the reported issues when understanding how the many decision trees are used for prediction, we used the recent SHAP [10] (SHapley Additive exPlanations), a unified framework for interpreting predictions. Specifically, we used the recent adaptation for tree ensemble methods [9]. SHAP solves reports that popular feature attribution methods are inconsistent and incapable of reporting the true impact of features in tree ensemble methods like XGBoost. Table 3 summarises the results of the nested CV procedure over the 9 factors, by reporting the averaged Mean Squared Errors (MSE) and Pearson-r correlations between actual and predicted values, with the respective standard deviations. In this Section we will only report results on Factors 1, 2, 3 and 7 as the remaining factors either had a high MSE standard deviation or Pearson-r correlation too close to zero. Figure 2 summarises another powerful interpretation that one can get from the SHAP method when applied to a prediction model. Each feature is assigned a SHAP value which represents both the magnitude and direction of its contribution. Its absolute value indicates how powerful that feature is to the prediction, and its signal expresses whether it contributes towards a higher or lower predicted value.  The median rank of brain features when predicting Fluid Intelligence is not very consistent across the outer folds, as it is possible to see in Table 4 where most of the features identified were ranked, on average, below 10, and the Median Absolute Deviation is quite high as well. Some of the brain areas identified are situated around the temporal lobes (middle, inferior and entorhinal/medial, as well as parahippocampal cortex), in which surface areas and gray volumes were more important to predict Fluid Intelligence. Some other areas around the brain were selected, representing not only gray volumes but also folding indexes and myelin values. The left hemisphere is selected much more times than the right hemisphere. As one can see from the most important features in one outer fold when predicting Fluid Intelligence (cf. Figure 2a), generally a higher value of that feature contributes to a higher value of Fluid Intelligence, with a clear exception of the myelin of the Post-Central and Entorhinal area.

Results
The most important features selected across the five outer folds for the prediction of Visual Memory are much more consistent across the outer folds, as the median ranks of the 10 most important features are below 20 (cf. Table 5) and generally the Median Absolute Deviation is below 10. For this factor the most important features are clearly the folding index of the insula and the gray volume of the entorhinal cortex. Frontal lobes play an important role, by having features being selected for the superior frontal, caudal middle frontal, rostral middle frontal, lateral orbitofrontal, and pars opercularis cortex regions, mostly with cortical thicknesses selected. In contrast with Fluid Intelligence, the right hemisphere is selected much more times to predict Visual Memory, which is consistent with previous findings linking the right hemisphere to visual memory [7], although the gray volumes of the entorhinal region were selected in both hemispheres.
As one can see from Figure 2b, when most of these features have a low value, the predicted Visual Memory will be higher. There are some exceptions, but an highlight is the behaviour of the most important feature in terms of median rank Prediction of the sex-related factor yielded not only the best pearson-r correlation values, but the most consistent median ranks across the outer folds, with the 10 most important features always being in top 15 (cf. Table 6). Insula plays a big role in distinguishing a male-or female-like brain, specifically with regards to its gray volume in both hemispheres, as well as the myelin in the left hemisphere. After the insula, the frontal lobes are more prominent, with the presence of features extracted from the medial orbitofrontal cortex, pre-central and rostral middle frontal areas. There is a clear dominance of the left hemisphere in predicting this factor. Table 6: Median rank of the ten most important features when predicting the Sex-related factor. We considered features that appeared at least in 4 folds. MAD stands for Median Absolute Deviation

Brain Feature
Median Rank (MAD) r insula grayvol 1 (0) l insula grayvol 3 (1) l medialorbitofrontal grayvol 3 (2) r inferiorparietal area 4 (1) l temporalpole meancurv 8 (1) l insula myel 8 (5) l transversetemporal myel 11 (2) l precentral myel 12 (5) l rostralmiddlefrontal grayvol 13 (8) r precentral area 13 (11) From Figure 2c it is possible to find a clear trend in which a lower feature value help in a lower Sex-related factor or, in other words, to a brain looking more like a female one. It is also possible to see that the gray volumes of the insula in both hemispheres have a higher SHAP value, in magnitude, than any other feature which, adding to its median rank, shows how important that feature is in distinguishing male-and female-like brains.
Finally, the median ranks of the 10 most important features when predicting Linguistic Skills are not very consistent across the outer folds, but not as inconsistent as when predicting Fluid Intelligence (cf. Table 7). The presence of regions from the frontal lobes is significant, namely the pre-central, rostral anterior cingulate, frontal pole, and caudal anterior cingulate regions. However, the most significant regions are around the temporal lobes: superior temporal, transverse temporal, lingual, parahippocampal, and middle temporal. As opposed to the other factors analysed, there is no clear dominance of one hemisphere over the other.
Differently from the other factors, the direction of contribution of each feature in predicting Linguistic Skills is not in one major direction (cf. Figure 2d).

Conclusion
We aimed to predict individual differences in cognitive functioning only by using brain surface-based morphometry and cortical myelin estimates. Although this is a notoriously challenging problem, our regression model yielded good performance (Pearson-r correlation of almost 0.8) in predicting a sex-related factor, and almost a satisfactory performance (Pearson-r correlation around 0.11) in predicting fluid intelligence, visual memory and linguistic skills (cf. Table 3). These results highlight the potentiality of these type of features being used to predict cognitive performances. We could not achieve fair performance or stability in predicting Executive Functions, Sustained Attention, Waiting Impulsivity, Verbal Episodic Memory and Visuo-spatial Skills, hinting that structural neuroimages alone cannot predict these cognitive factors, and other types of data are probably needed.
We consider that the approach described in this paper shows that the usage of specific, well-defined and self-explanatory features can help in the quick identification of meaningful regions of interest that are important to interpret machine learning methods, avoiding visual interpretation of neuroimages, which do not convey as much meaning as the features we extracted. From our analysis it was possible to see that while some regions of interest appear more often, others do not appear so often thus suggesting that they might not be as important for understanding general cognition. As well, some specific features seem to not be important as they never appeared in the lists of most important features (eg. number of vertices, integrated rectified gaussian curvature, and intrinsic curvature index). It was also interesting that some factors had a clear difference of which hemisphere contributes more.
The significance of these results are supported in the fact that we used HCP, which is not only one of the most homogeneous and well characterised open datasets for healthy subjects, but also has an impressive number of people to be analysed (around a thousand), bringing more credibility to the results we presented.
We hope our work could serve as inspiration for researchers in the field to better study differences of cognition using well-implemented machine learning models that can be easily and correctly interpreted. gaussian curvature), foldind (folding index), curvind (intrinsic curvature index), and myel (myelin estimate).