Multidimensional analysis and detection of informative features in diffusion MRI measurements of human white matter

The white matter contains long-range connections between different brain regions and the organization of these connections holds important implications for brain function in health and disease. Tractometry uses diffusion-weighted magnetic resonance imaging (dMRI) data to quantify tissue properties (e.g. fractional anisotropy (FA), mean diffusivity (MD), etc.), along the trajectories of these connections [1]. Statistical inference from tractometry usually either (a) averages these quantities along the length of each bundle in each individual, or (b) performs analysis point-by-point along each bundle, with group comparisons or regression models computed separately for each point along every one of the bundles. These approaches are limited in their sensitivity, in the former case, or in their statistical power, in the latter. In the present work, we developed a method based on the sparse group lasso (SGL) [2] that takes into account tissue properties measured along all of the bundles, and selects informative features by enforcing sparsity, not only at the level of individual bundles, but also across the entire set of bundles and all of the measured tissue properties. The sparsity penalties for each of these constraints is identified using a nested cross-validation scheme that guards against over-fitting and simultaneously identifies the correct level of sparsity. We demonstrate the accuracy of the method in two settings: i) In a classification setting, patients with amyotrophic lateral sclerosis (ALS) are accurately distinguished from matched controls [3]. Furthermore, SGL automatically identifies FA in the corticospinal tract as important for this classification – correctly finding the parts of the white matter known to be affected by the disease. ii) In a regression setting, dMRI is used to accurately predict “brain age” [4, 5]. In this case, the weights are distributed throughout the white matter indicating that many different regions of the white matter change with development and contribute to the prediction of age. Thus, SGL makes it possible to leverage the multivariate relationship between diffusion properties measured along multiple bundles to make accurate predictions of subject characteristics while simultaneously discovering the most relevant features of the white matter for the characteristic of interest.

Diffusion-weighted Magnetic Resonance Imaging (dMRI) provides a unique view into 2 the physical properties of the connections that comprise the brain white matter. While 3 the measurements are usually conducted with voxels at the millimeter scale, water 4 molecules within each voxel diffuse with characteristic lengths at the micrometer scale, 5 providing aggregate information about the physical structure of the white matter, 6 including the density of axons and distribution of fiber orientations within each voxel [6]. 7 Even though metrics derived from diffusion measurements are ambiguous in terms of 8 their underlying biological interpretation [7], analyzing the variance in these properties 9 has proven useful in characterizing individual differences in cognitive function, 10 characterizing differences between populations and detecting brain abnormalities 11 associated with disease [8]. 12 To relate the diffusion in each voxel to the macro-structure of long-range connections 13 between different brain regions, methods for computational tract-tracing from diffusion 14 MRI, or tractography, combine the estimates of fiber orientations in each voxel to form 15 streamlines that traverse the volume of the white matter [9,10]. These methods have 16 been under increased scrutiny and several lines of investigation have raised critiques of 17 their validity [11,12]. On the other hand, there have been efforts to shore up the 18 inferences made with these methods [13][14][15][16][17][18]. Importantly, though discovering novel 19 tracts requires extraordinary evidence, and delineating the exact cortical termination of 20 the streamlines in the gray matter is still prone to error, there is little dispute that 21 tractography can accurately define the location of several major white matter tracts 22 that are known to exist within the core of the white matter [11,19]. 23 Leveraging this fact, one of the most powerful methods currently available to put 24 macro-and micro-structure together is tractometry: assessment of the physical 25 properties of the white matter along specific tracts [20]. Though there are several 26 different available implementations of this overall idea, the principles are 27 similar [1,[21][22][23]: tractometry begins by delineating the parts of the white matter that 28 belong to different major "tracts" (i.e. anatomical or functional groups of white matter 29 fibers), such as the corticospinal tract or arcuate fasciculus, assigning tractography 30 generated streamlines to "bundles," which approximate the anatomical tracts, and 31 sampling biophysical properties (such as fractional anisotropy or mean diffusivity) along 32 the length of these bundles. the parts of the white matter that belong to different major 33 tracts (i.e. anatomical or functional groups of white matter fibers), such as the 34 corticospinal tract or arcuate fasciculus, assigning tractography generated streamlines to 35 "bundles," which approximate the anatomical tracts, and sampling biophysical properties 36 (such as fractional anisotropy or mean diffusivity) along the length of these bundles. In 37 some previous tractometry-based studies, tissue properties along the length of each tract 38 were summarized by taking the mean along each bundle, but there is a large body of 39 evidence showing that there is systematic variability in the values of diffusion metrics 40 along the trajectory of each bundle. This justifies retaining the individual samples along 41 the length of each bundle [1,23,24] family-wise error across the different possible comparisons [25,26]. 53 2. Region of interest(ROI)-based approaches: An alternative that circumvents the 54 multiple comparison problem is to select just a few tracts to compare in each 55 individual, or even focusing on particular segments of these tracts based on a 56 priori hypotheses. This approach is very powerful when the biological basis of the 57 process of interest is relatively well understood (for a recent example, see [27]). 58 3. ROI-based selection, followed by multivariate analysis: Here, an ROI is selected 59 based on a priori knowledge, and all the nodes or voxels in the ROI are used 60 together to fit a model that can predict differences between individuals. An 61 example of that is the "profilometry" framework, in which different diffusion 62 metrics from a single tract are combined together to provide input to a 63 multivariate analysis of covariance, and linear discriminant analysis [28]. 64 Generally speaking, analysis methods should balance predictive accuracy with 65 descriptive power [29,30]. Accordingly, tractometry analysis should simultaneously 66 capitalize on all the data across all tracts to make the best possible prediction, while 67 also retaining and elucidating spatial information about the locations that are most 68 informative for a prediction. In the present work, we developed a novel framework for 69 analysis of tractometry that simultaneously selects the features for analysis, and fits a 70 model to these features. We use a linear modeling approach, which aims to predict 71 phenotypical variance in a group of subjects, based on a linear combination of the 72 features estimated with tractometry.

73
Using this approach, we first need to deal with the large and asymmetric 74 dimensionality of the data: tractometry data usually has many more features (i.e., feature [32]. This tends to shrink to zero the contributions of many of the features, 83 providing results that are both accurate and interpretable. When additional structure is 84 available in the organization of the data, regularization algorithms can take advantage 85 of this structure. For example, if the features lend themselves to a natural division into 86 different groups, the group lasso (GL) can be used to select groups of features, rather 87 than individual features [33]. The Sparse Group Lasso (SGL) elaborates on this idea by 88 providing control both of group sparsity, as well as overall sparsity of the solutions [34]. b = 1000 sec mm 2 . These data were acquired using a dual spin echo sequence, in which 118 there is sufficient time for eddy currents to subside between the application of the 119 gradients and the image acquisition, so no eddy current correction was applied, 120 but motion correction was applied before fitting the diffusion tensor model [35] in 121 every voxel using a robust fit [36]. We will refer to this dataset as WH.

122
Data from both of these studies was processed in a similar manner, using the Matlab 123 Automated Fiber Quantification toolbox (AFQ) [1]: streamlines representing fascicles of 124 white matter tracts were generated using a determinstic tractography algorithm that 125 follows the prinicpal diffusion direction of the diffusion tensor in each voxel (STT) [37]. 126 Major tracts were identified using multiple criteria: streamlines are selected as 127 candidates for inclusion in a bundle of streamlines that represents a tract if they pass 128 through known inclusion ROIs and do not pass through exclusion ROIs [38]. In 129 addition, a probabilistic atlas is used to exclude streamlines which are unlikely to be 130 part of a tract [39]. Each streamline is resampled to 100 nodes and the robust mean at 131 each location is calculated by estimating the 3D covariance of the location of each node 132 and excluding streamlines that are more than 5 standard deviations from the mean 133 location in any node. Finally, a bundle profile of tissue properties in each bundle was 134 created by interpolating the value of MRI maps of these tissue properties to the location 135 of the nodes of the resampled streamlines designated to each bundle. In each of 100 136 nodes, the values are summed across streamlines, weighting the contribution of each 137 streamline by the inverse of the mahalnobis distance of the node from the average of 138 that node across streamlines. This means that streamlines that are more representative 139 of the tract contribute more to the bundle profile, relative to streamlines that are on the 140 edge of the tract.

141
This process creates bundle profiles, in which diffusion measures are quantified and 142 averaged along twenty major fiber tracts. Here, we use only the mean diffusivity (MD) 143 and the fractional anisotropy (FA) of the diffusion tensor, but additional dMRI-based 144 maps or maps based on other quantitative MRI measurements can also be used. These 145 bundle profiles, along with the phenotypical data we wish to explain or predict, form 146 the input to the SGL algorithm. In a domain-agnostic machine learning context, the 147 phenotypical data comprise the target variables while the bundle profiles form the 148 feature or predictor variables (See Fig 1).

Sparse Group Lasso
150 Before fitting a model to the data, imputation and standardization are performed.

151
Missing node values (e.g., in cases where AFQ designates a node as not-a-number) are 152 imputed via linear interpolation. Care is taken not to interpolate across the boundaries 153 between different bundles. Some diffusion metrics will have naturally larger variance 154 than others and may therefore dominate the objective function and make the SGL 155 estimator unable to learn from the lower variance metrics. For example, fractional 156 anisotropy (FA) is bounded between zero and one and could be overwhelmed by an 157 unscaled higher-variance metric like the mean diffusivity (MD). To prevent this we 158 remove each feature's mean and scale it to unit variance (z-score) using the 159 StandardScaler from scikit-learn [40]. Scaling is performed separately within each 160 cross-validation set's training or testing data to prevent leakage of information between 161 the testing and training sets [41]. 162 After scaling and imputation, the tractometry data and target phenotypical data can 163 be organized in a linear model: where y is the phenotype -categorical, such as a clinical diagnosis, or continuous 165 numerical, such as the subject's age. The tractometry data is represented in the feature 166 matrix X, with rows corresponding to different subjects, and columns corresponding to 167 diffusion measures at different nodes within each bundle. The relationship between 168 tractometric features and the phenotypic target is characterized by the coefficients in β. 169 The error term, is an unobserved random variable that captures the error in the model. 170 We denote our prediction of the targer phenotype asŷ and the coefficients that produce 171 this prediction asβ, so that without the error term, . In general, the feature matrix X has dimensions Connectome Project, where 1,200 subjects were measured [42]), the problem is ill-posed: 180 the high dimensionality of this data requires regularization to avoid overfitting and 181 generate interpretable results.

182
Here, we propose that in addition to regularizing the coefficients inβ, we can also 183 capitalize on our knowledge of the group structure of the bundle profile features in X. 184 The bundle-metric combinations form a natural grouping. For example, we expect that 185 MD features within the left arcuate fasciculus will co-vary across individuals. Likewise 186 for FA values within the right corticospinal tract (CST) and so on. This group structure 187 is represented in Fig 1,  technique that satisfies exactly these criteria [2]. It solves for a coefficient vectorβ that 194 satisfies where G is the number of groups X ( ) is the submatrix of X corresponding to group , 196 β ( ) is the coefficient vector for group and p is the length of β ( ) . In the tractomtetry 197 setting, G = T × M and ∀ : p = 100. The first term is the mean square error loss,

198
L mse , as in the standard linear regression framework. The second and third terms 199 encourage groupwise sparsity and overall sparsity, respectively. If λ 1 = 0 and λ 2 = 1,

200
the SGL reduces to the traditional lasso [43]. Conversely, if λ 1 = 1 and λ 2 = 0, the SGL 201 reduces to the group lasso [44]. Often, the target variable y is not in the domain in which the linear model can be best 204 fit to it. Equation (2) can be slightly modified as: where the transformation function f −1 characterizes the transform applied to the data 206 before fitting the linear coefficients. For example, an often-used transform is the 207 logarithmic transform: In this case, the model is parametrized by one additional fit parameter, n.
December 19, 2019 6/18 or equivalently, the mean squared error loss function in Eq (3) is replaced with the 215 cross-entropy loss, which for binary classification is the negative log likelihood of the 216 SGL classifier giving the "true" label: Implementation, cross-validation and metaparameter 218 optimization 219 For given values of λ 1 and λ 2 , the cost function in Eq (3) can be optimized using 220 proximal gradient descent methods [45] here implemented as a custom proximal 221 operator that is then optimized using the C-OPT library [46]. This supplies an estimate 222 of the optimalβ given a particular set of values for the meta-parameters λ 1 and λ 2 .

223
To objectively evaluate the model and guard against over-fitting, we used a nested subjects (i.e. rows of the feature matrix X in Fig 1 and Eq (1)) are shuffled and then 226 decomposed into k batches, hereafter referred to as folds. For the ALS dataset we used 227 k = 10 and for the WH dataset k = 5. For each unique fold, we hold that fold out as 228 the test outer set and let the remaining data comprise the train outer set, with the 229 subscript indicating the depth of the nested decomposition. We further decompose each 230 train outer set into three folds, and again for each unique fold, we hold out that fold as 231 the test inner set and let the remaining data comprise the train inner set. At level-1 of the 232 decomposition, we fit an SGL model using fixed regularization meta-parameters λ 1 and 233 λ 2 , training the model using train inner and evaluating the fit on test inner . We find the 234 optimal values for λ 1 and λ 2 using hyperoptimization, implemented using the hyperopt 235 library's fmin function [47] with a tree of Parzen estimators search algorithm [48]. For 236 continuous numerical y, fmin searches for meta-parameter values that minimize the 237 median absolute error. This can also be done minimizing the root of the mean squared 238 error (RMSE) or to maximizing the coefficient of determination (R 2 ). For binary 239 categorical y fmin seeks to maximize the classification accuracy. This can also be done 240 maximizing the area under the receiver operating curve (ROC AUC) or the average 241 precision. Using hyperoptimization, we find optimal regularization parameters andβ for 242 each train outer set and then use those to predict values for data in test outer . Thus each 243 subject in the dataset has a predicted phenotype derived from a model that never saw 244 its particular subject's data.

245
The above procedure describes k-fold cross validation. In fact, we use repeated 246 k-fold cross validation on the outer level of the decomposition, so that the input data is 247 decomposed into k folds, three times. Thus, each subject has three predicted phenotypes. 248 We then take the mean predicted value to summarize the performance of the model. In 249 the classification case, the shuffling into folds is stratified such that each fold has a 250 population that preserves the percentage of each class found in the larger input data.

Software implementation 252
The full software implementation of the SGL approach presented here is available in a 253 Python software package called AFQ-Insight, which is developed publicly in feature data that has been processed by AFQ from comma separated value (CSV) files 258 conforming to the AFQ-Browser data format [49] and represents them internally as 259 DataFrame objects from the pandas Python library [50].  . For each set of hyperparameters, we solve the SGL using a custom 270 proximal operator supplied to the C-OPT library [46]. Appropriate hyperparameters 271 are found using the hyperopt library [47].

272
Results and discussion 273 We developed a method for analyzing dMRI tractometry data with SGL. We  Figure 3 with "ground-truth" ALS status separated into 285 columns, and predicted ALS status encoded by color. In addition to this classification 286 performance, SGL also identifies the white matter tracts most important for ALS classification. The relative importance of white matter features is captured in the β 288 coefficients from Eq (3). Figure 4 depicts these coefficients along the right CST, plotted 289 over the FA values for the control and ALS subject groups. We find that SGL selects 290 FA metrics in the corticospinal tract and particularly in the right corticospinal tract as 291 most important to ALS classification, confirming previous findings [51][52][53][54][55][56][57][58][59][60] and 292 identifying the portions of the brain that were selected a priori in the previous study 293 from which we collected our data [3].

294
Analyzing the ways in which the model mislabels individuals may also provide 295 insight. We found that mislabelled subjects are outliers relative to their group with 296 respect to diffusion features of the CST. Figure 5  To test the performance of SGL with tractometry data in a continuous regression task, 304 we focus here on the prediction of biological age based on tractometry data. Prediction 305 of "brain age" is a commonly undertaken task. This is both because it operates on a 306 natural scale, with meaningful and easily understood units, as well as because 307 predictions of brain age, and deviations from accurate prediction are diagnostic of 308 overall brain health (for a recent review, see [61]). The WH dataset used here contains 309 data from 76 healthy subjects, ranging between 6 years and 50 years of age [4]. In this 310 case, biological age was used as the predicted variable (y in Eq (1)). SGL was fit to  batch (or fold), the model is fully fit without this data. Then, once the parameters are 315 fixed, the model is inverted to predict the ages of held out subjects based on the linear 316 coeffiecients and the static non-linearity. This scheme automatically finds the right level 317 of regularization (i.e., sparseness) and fits the coefficients to the ill-posed linear model, 318 while guarding against overfitting. SGL accurately predicts the age of the subjects in 319 this procedure, with a mean absolute error of 3.6 years ( Figure 6, left panel). This is 320 lower than the results of a recent study that predicted age in a large sample, based on 321 diffusion MRI features [62]. Nevertheless, older subjects have higher residual variance, 322 reflecting the automatically-chosen log-transformation and implying that brain age distributed amongst brain regions and measured diffusion properties.

345
Specifically, we demonstrated that the algorithm correctly detects the fact that ALS, 346 which is a disease of lower motor neurons, is localized to the cortico-spinal tract. This 347 recapitulates the results of previous analysis of these same data, using a targeted 348 ROI-based approach [3]. In contrast, for the analysis of biological age, the coefficients 349 identified by the algorithm are very widely distributed across many parts of the white 350 matter, mirroring previous results with this dataset (and others) that show a large and 351 continuous distribution of life-span changes in white matter properties [4].

352
Taken together, these results demonstrate the promise of the group-regularized 353 regression approach. Even at the scale of dozens of subjects, the results provided by 354 SGL are both accurate, as well as interpretable [29]: tractometry capitalizes on domain 355 knowledge to engineer meaningful features; SGL scores these features based on their 356 relative importance; enables a visualization of these feature importance scores in the 357 anatomical coordinate frame of the bundles (e.g., Figures 3 and fig:regress-beta) and 358 provides a means to understand model errors (e.g., Figure 5) . Thus, this multivariate 359 analysis approach both (a) achieves high cross-validated accuracy for precision medicine 360 applications of dMRI data and (b) identifies relevant features of brain anatomy that can 361 further our neuroscientific understanding of clinical disorders.

362
Neuroscience has entered an era in which consortium efforts are putting together 363 large datasets of high-quality dMRI measurements to address a variety of scientific 364 questions [42,[63][64][65][66]. The volume and complexity of these data pose a substantial 365 challenge. Dimensionality reduction with tractometry, followed by analysis with the 366 approach we present here promises to capitalize on the wealth of data and on the 367 co-measurement of interesting and important phenotypical data about brain health and 368 about the participants' cognitive abilities. We also expect the group-regularized 369 approach to improve with larger datasets.  The results we present here also motivate extensions of the method using more 385 sophisticated cost functions. For example, the fused sparse group lasso (FSGL) [69] 386 extends SGL to enforce additional spatial structure: smoothness in the variation of 387 diffusion metrics along the bundles. As brain measurements include additional structure 388 (for example, bilateral symmetry), future work could also incorporate overlapping group 389 membership for each entry in the tract profiles [70]. For example, a measurement could 390 come from the corpus callosum, but also from the right hemipshere. This would also 391 require extending the cost function used here to incorporate these constraints. Similarly, 392 unsupervised dimensionality reduction of tractometry data (e.g., [71]) could also benefit 393 from constraints based on grouping.

394
The method is packaged as open-source software called AFQ-Insight that is openly 395 available, and provides a clear API to allow for extensions of the method. The sofware 396 integrates within a broader automated fiber quantification software ecosystem: AFQ [1], 397 which extracts tractometry data from raw and processed dMRI datasets, as well as 398 AFQ-Browser, which visualizes tractometry data and facilitates sharing of the results of 399 dMRI studies [49]. To facilitate reproducibility and ease use of the software, the results 400 presented in this paper are also provided in https://github.com/richford/