## Abstract

Neuroimaging allows for the non-invasive study of the brain in rich detail. Data-driven discovery of patterns of population variability in the brain has the potential to be extremely valuable for early disease diagnosis and understanding the brain. The resulting patterns can be used as imaging-derived phenotypes (IDPs), and may complement existing expert-curated IDPs. However, population datasets, comprising many different structural and functional imaging modalities from thousands of subjects, provide a computational challenge not previously addressed. Here, for the first time, a multimodal independent component analysis approach is presented that is scalable for data fusion of voxel-level neuroimaging data in the full UK Biobank (UKB) dataset, that will soon reach 100,000 imaged subjects. This new computational approach can estimate modes of population variability that enhance the ability to predict thousands of phenotypic and behavioural variables using data from UKB and the Human Connectome Project. A high-dimensional decomposition achieved improved predictive power compared with widely-used analysis strategies, single-modality decompositions and existing IDPs. In UKB data (14,503 subjects with 47 different data modalities), many interpretable associations with non-imaging phenotypes were identified, including multimodal spatial maps related to fluid intelligence, handedness and disease, in some cases where IDP-based approaches failed.

## 1 Introduction

Large-scale multimodal brain imaging has enormous potential for boosting epidemiological and neuroscientific studies, generating markers for early disease diagnosis and prediction of disease progression, and the understanding of human cognition, by means of linking to clinical or behavioural variables. Recent major studies have been acquiring brain magnetic resonance imaging (MRI), genetics and demographic/behavioural data from large cohorts. Examples are the UK Biobank (UKB)^{1}, the Human Connectome Project (HCP)^{2} and the Adolescent Brain Cognitive Development (ABCD) study^{3}. These studies involve multimodal data, meaning that several distinct types of MRI data are acquired, mapping activity, functional networks, structural connectivity, white matter microstructure, and organisation and volumes of different brain tissues and sub-structures^{1}. However, the multimodal, high-dimensional and noisy nature of such big datasets makes many existing analytical approaches for extracting interpretable information impractical^{4}.

Traditionally, large-scale neuroimaging studies first summarize the imaging data into interpretable image-derived phenotypes (IDPs)^{1,5}, which are scalar quantities derived from raw imaging data (e.g., regional volumes from structural MRI, mean task activations from task MRI, resting-state functional connectivities between brain parcels). This knowledge-based approach is simple and efficient, and effectively reduces the high-dimensional data into interpretable, compact, convenient features. However, there may well be a large loss of information, due to such “expert-hand-designed” features not capturing important sources of subject variability (or even just losing sensitivity by the pre-defined spatial sub-areas being suboptimal), as well as ignoring cross-modality relationships. Further, such uni-modal compartmentalised analyses do not utilise the fact that for many biological effects of interest we expect there to be biological convergence across different data modalities, iė; changes in the underlying biological phenotype likely manifest themselves across multiple quantitative phenotypes, so that a joint analysis effectively increases both the power of detecting such effects and the interpretability of the findings.

In contrast to such uni-modal analyses, data-driven multivariate approaches (i.e., unsupervised machine learning) have been proposed, which perform simultaneous decomposition of voxel-level data directly, generally representing data as the summation of a number of “components” or “modes”. Each mode is formed as the outer product of two vectors: one is a vector of subject weights (describing the relative strength of expression of that mode in each subject), and a vector of voxel weights (in effect a spatial map for each data modality, describing the spatial localisation of the mode). The subject weight vectors (one per mode) can be considered “features”(similar to IDPs, but being data-driven) for use in further modelling, such as for the prediction of non-imaging variables. They are often either based on eigen-decomposition, such as multi-set canonical correlation analysis (mCCA)^{6,7}, or based on variations of independent component analysis (ICA)^{8–11}. Among them, FMRIB’s Linked ICA (FLICA)11 is an efficient approach which has been successfully applied to identify brain systems that are involved in lifespan development and diseases^{12,13}, attention deficit hyperactivity disorder^{14}, preterm brain development^{15} and cognition and psychopathology^{16}. FLICA has advantages compared with uni-modal analysis on IDPs, including: (1) It leverages the cross-modality information of multimodal data, so has the potential to detect patterns that are not discoverable in any single modality; (2) It is a data-driven objective approach which automatically discovers meaningful patterns in voxel-level multimodal data by searching for spatial non-Gaussian sources that have been shown to likely reflect real structured features in neuroimaging data^{17}. While this approach has been applied successfully to medium-sized cohort data^{12–16}, the original algorithms for carrying out FLICA do not scale well with increasing data size, and are unable to analyze large datasets such as UKB, where dozens of different modalities over tens of thousands of subjects are available. Importantly, because the core FLICA algorithms are multivariate, acting in a complex way simultaneously across all subjects, modalities and voxels using Variational Bayesian updates of parameters, this problem cannot be solved through simple parallelisation or other algorithmically simple methods for distributing computations across a large cluster, and so cannot be addressed simply by increasing the number of processors or memory available.

To tackle this problem, we propose an approach that embeds advanced data compression techniques across the different data dimensions into the FLICA approach. We use a multimodal extension of MELODIC’s Incremental Group Principal component analysis^{18} (mMIGP, applied across modalities) and online dictionary learning^{19} (DicL, applied within-modalities) to efficiently reduce the size of multimodal neuroimaging data. The reduced data are then characterised through FLICA in terms of underlying modality-specifc maps and subject loading vectors. Here we refer to this combination of techniques as Big-data FLICA, or BigFLICA for short). Two important advantages of the proposed approach are: (1) Preserving key information in original data but also reducing the effects of stochastic domain-specific noise; (2) Increasing the computational efficiency of the FLICA algorithm for extremely large population datasets. BigFLICA is scalable for simultaneously analyzing all the multimodal data of the full 100,000-subjects UKB dataset using only a modest computing cluster (**Fig. 1**).

We first demonstrate the effectiveness of our approach through extensive simulations. Then, in real data, we quantify performance in terms of the prediction accuracy of non-imaging-derived phenotypes (nIDPs)^{20,21}, such as health outcome measures. Using voxel-level imaging data of 81 modalities from 1,003 subjects in the HCP and 47 modalities from 14,053 subjects in the UKB, we show that BigFLICA can perform comparably with original FLICA^{11} in terms of the prediction accuracy for nIDPs (158 in HCP and 8,787 in UKB). Most importantly, we systematically investigated whether there are benefits to jointly fusing multimodal data together, instead of analysing them separately. We show that significant improvements in the prediction accuracy of nIDPs are found when comparing a high-dimensional BigFLICA with other widely-used data analysis strategies: (1) doing single-modality ICA and concatenating the results across modalities and (2) using existing IDPs (5,812 in HCP and 3,913 in UKB). In particular, the improvements in prediction of many health outcome and cognitive variables are large, more than doubling prediction accuracy for some variables. Furthermore, we investigate the relationship between modes derived by BigFLICA and IDPs. We find that although the modes were estimated from the same set of voxel-level data, they have complementary information which can be combined together to further increase the prediction accuracy of nIDPs. Finally, we applied BigFLICA to analyze the UKB data and extracted 750 components. Existing multimodal ICA cannot estimate this many modes from this many subjects. We found several interpretable associations between modes of BigFLICA and nIDPs, including modes that relate to *fluid intelligence, handedness, age started wearing glasses or contact lenses* and *hypertension*. In many cases BigFLICA can find associations with nIDPs with greater statistical sensitivity than was possible with IDPs. Overall, BigFLICA demonstrated the advantages of data-driven joint multimodal modelling in the analysis of biobank-scale multimodal datasets.

## 2 Results

### Brief overview of the proposed approach: BigFLICA

FLICA^{11} is a Bayesian ICA approach for multimodal data fusion. The input of FLICA is *K* modalities’ data matrices *Y* ^{(k)} with dimensions *N* × *P _{k},k* = 1,…,

*K*, where

*P*is the number of features (e.g., voxels) and

_{k}*N*is the number of subjects. FLICA aims to find a joint

*L*-dimensional decomposition of all

*Y*

^{(k)}:

*Y*

^{(k)}=

*HW*

^{(k)}

*X*

^{(k)}+

*E*

^{(k)}, where

*H*

_{(N×L)}is the shared subject mode (mixing matrix) across modalities (a vector of subject weights for each mode), so is a ‘link’ across different modalities, is a positive diagonal mode-weights matrix (one overall weight per modality per mode), is the independent (spatial) feature maps for the

*L*components of a modality (one map per modality per mode), and is the modality-specific Gaussian noise term (

**Fig. 1**). We propose two efficient approaches that can either be used separately or combined together to reduce the size of the original data matrices, and therefore reduce the computational load of the original FLICA. An overview of BigFLICA is shown in

**Fig. 1**.

The first approach, termed multimodal extension of MELODIC’s Incremental Group Principal component analysis^{18} (mMIGP), aims to reduce the subject dimension to a linear combination of the original subjects. mMIGP is a time- and memory-efficient approximation of principal component analysis (PCA) on feature-concatenated multimodal data. To this end, if we aim to get a *L*^{★} decomposition, we first apply MIGP^{18} separately within each modality to estimate , which is an approximation of an *L*^{★}-dimensional PCA decomposition of one modality *Y*^{(k)}. This step can be done in parallel across modalities. Then, we concatenate all in the component dimension and apply another MIGP to get *U*_{(N×L★)}, which is an *L*^{★}-dimensional approximate PCA decomposition of all modalities together. Finally, we project the original data of each modality *Y*^{(k)} to the PCA-reduced space using*U*. If no further reduction (e.g., dictionary learning as below) is to be applied, the data that could then be fed into the core FLICA would be the *K* component-by-feature matrices *V*^{(k)} of size *L*^{★} × *P _{k}*, and FLICA would then extract

*L*(

*L*<

*L*

^{★}) components from these (

**Methods**). This step almost adds little computational cost compared with the original FLICA, because a similar PCA step is needed to initialize the parameters of the original FLICA, but this approach is feasible for large numbers of subjects and modalities. Although different modalities usually have different overall signal-to-noise ratios (SNR), which is largely ignored by this mMIGP step, the subsequent FLICA can take this into account by the modality-specific noise terms, and a high-dimensional mMIGP is used to capture modes with even small variations in each modality.

It is known that voxels are correlated in both a local fashion (local spatial autocorrelation) and across brain networks (long range correlation); hence, effective feature subsampling could hope to capture all important information in the data but also reduce the cost of spatial modelling in FLICA^{22}. Therefore, we incorporate an approach, termed sparse online Dictionary Learning^{19} (DicL), to reduce the dimension of feature (e.g., voxel) space that can capture both local and distant spatial correlation structure. Specifically, for each modality, we use DicL to model the *V*^{(k)} as a sparse linear combination of *L*^{★★} basis elements: *V*^{(k)} = *A*^{(k)}*D*^{(k)}, where *D*^{(k)} is the *sparse spatial dictionary basis*, and *A*^{(k)} is the *feature loadings*. By minimizing the reconstruction error, and enforcing sparsity in the dictionary basis *D*^{(k)}, we aim to achieve an optimal subsampling of feature space. The inputs of FLICA are then *K* smaller matrices *A*^{(k)}, which are only of dimension *L*^{★} × *L*^{★★}, and FLICA then extracts *L* (*L* < min(*L*^{★}, *L*^{★★})) components from these (**Methods**). Compared with doing FLICA with the original *K* large *N* × *P _{k}* matrices, usingthe DicL preprocessed data can greatly reduce the computation load of FLICA. DicL can easily be parallelized across modalities and is memory friendly, which further increases efficiency (

**Fig. 1**).

### Evaluation of BigFLICA in simulations

We first applied BigFLICA on simulated data to evaluate the performance of mMIGP and DicL as data preprocessing approaches under different parameter settings and data signal-to-noise ratios. The mean correlation of extracted components with simulated ground truth was compared with the corresponding result from the original FLICA (**Methods**).

FormMIGP, **Fig. 2a** shows that, in most of the situations, the BigFLICA with mMIGP preprocessing gave similar results to the original FLICA, and both FLICA and BigFLICA accurately find the underlying ground truth in most cases. This is in agreement with results of simulations in the MIGP paper^{18} that it can accurately approximate a full-data PCA in different situations. The optimal dimension of mMIGP is different among simulations; sometimes a relative low dimension can achieve an accurate estimation of components (e.g. **Fig. 2a** first three columns), while in other cases a high dimension is needed (e.g. **Fig. 2a** the fourth column).

For DicL, **Fig. 2b** shows that in almost all circumstances: (1) increasing the dictionary dimensions will boost the performance of subsequent FLIC Aanalysis; (2) the optimal sparsity parameters are usually between *λ* = 0.5 to 2, and they have similar performance; (3) In most cases the optimal performance given by DicL matches that of non-reduced analysis (noted in figure legends). Therefore, in the real data analysis, when using the DicL approach, we always use a very high dimensional DicL decomposition and fix the sparsity parameter to *λ* = 1.

### Computation time comparison

**Table 1** shows the comparison of the computation time and memory requirement of BigFLICA with the original FLICA in the UKB dataset. All code was implemented in Python 2.7, and both BigFLICA and FLICA were run using 24 cores on a single compute node with Intel Xeon CPU E7-8857 v2 @ 3.00GHz CPU and 2048 GB RAM. The computation time includes: (1) Preprocessing of data using mMIGP and DicL (BigFLICA only); (2) Initialization of FLICA parameters; (3) FLICA VB parameter updates. For the 100,000-subjects data, BigFLICA greatly decreases the computation time and memory usage from an unrealistic amount to a modest configuration for a modern HPC cluster, which therefore allows for the possibility of data-driven population phenotype discovery.

### Real data: Comparing BigFLICA with the original FLICA based on the prediction accuracy of nIDPs

As there is no ground truth available, we tested modes of BigFLICA have a similar prediction accuracy of nIDPs compared with the original FLICA, using data from the HCP, and a subset of 1,036 subjects from the UKB. Elastic-net regression with nested 5-fold cross-validation was used to predict each of the nIDPs. This approach is widely-used and has been shown to achieve a robust and state-of-the-art performance in many neuroimaging studies^{24,25}. Pearson correlation between each of the predicted and the true nIDPs in the outer test fold is used to quantify accuracy. The statistical significance of differences of prediction accuracy between two approaches are estimated by a weighted paired t-test approach. (**Methods**).

**Fig. 3** shows the Bland-Altman plots comparing the prediction accuracy of nIDPs between original FLICA and BigFLICA with mMIGP preprocessing only (**Fig. 3a**), and with DicL preprocessing only (**Fig. 3b**), and with both data reduction approaches (**Fig. 3c**), in the UKB and HCP datasets. In these comparisons, mMIGP reduced the data to approximately 1/10 to 1/2 of the original data size, and DicL reduced data to approximately 1/75 of the original data size. Overall, BigFLICA can estimate similar sets of modes with comparable prediction accuracy in real multimodal neuroimaging data, i.e., the difference of the correlation between two methods is centered around zero across a wide range of mean correlation values (which are also reflected in the insignificant p-values of weighted paired t-test), which demonstrates that the mMIGP and DicL approaches are effective to reduce data and preserve key information in the data.

### Comparing BigFLICA with multiple independent single-modality ICA decomposition

We also compared BigFLICA outputs against features pooled across those from separate ICA processing of each modality. **Fig. 4a** shows that BigFLICA has a worse prediction performance than via running ICA separately on each modality when the dimensionality *L* is low. This is because at low dimensional decomposition, single-modality ICA is most efficient because the constraints imposed on the degrees-of freedom implied in the FLICA model is insufficient to capture the important data variation into joint components. However, when *L* becomes large, the prediction accuracy becomes better than the single-modality ICA (e.g., ⩾ 250 in UKB). This is because, at high dimensional decomposition, BigFLICA effectively combines multimodal information by considering cross-modal correlation in the data decomposition stage. Although the cross-modal correlation is considered in the final prediction stage when using single-modality ICA, the fact that BigFLICA identifies and takes advantage of correlated information between modalities at an earlier stage in feature generation helps improve the prediction performance.

### Comparing BigFLICA with hand-curated imaging-derive phenotypes

We compared the predictive performance of BigFLICA with IDPs in both HCP and UKB datasets (**Methods**). **Fig. 4b** shows that, in the UKB data, when the number of modes is low, BigFLICA has a worse predictive power than the joint performance of 3,913 IDPs, due to the same insufficient degree-of-freedom reason as above. However, when the dimensionality becomes higher, BigFLICA is clearly outperforming the IDPs, owing to jointly fusing multimodal voxelwise data by considering cross-modality correlation. In the HCP data, the performance is overall similar. These results indicate that BigFLICA can potentially explain more pheno-typic and behavioural variances than IDPs.

In more detail, **Tables S1** shows that, in the UKB dataset, the high-dimensional BigFLICA (nIC=750) has improved prediction accuracy for many nIDPs that relate to *cognition phenotypes* and *health outcomes* compared with IDPs. These tables do not include nIDPs where both methods have low predictive power (*r* < 0.1). In the HCP dataset (**Table S2**), BigFLICA (nIC=100) also shows improved prediction accuracy in many cognitive and health outcomes variables compared with using IDPs.

Further, when we concatenated the modes of BigFLICA and IDPs together to predict nIDPs, as shown in **Fig. 4c**, the combined feature sets have a significant improvement of prediction accuracy than the IDPs alone in the UKB data. There are almost no differences for the same comparison in the HCP data. This suggests that BigFLICA and IDPs may contain some complementary information of nIDPs.

To investigate the relationships between BigFLICA and IDPs further, we built prediction models that used modes of BigFLICA to predict each of the IDPs, to further characterise information overlap and complementarity between the two approaches. As shown in **Figs. S2a** and **S2b**, different types of IDPs can be predicted differently, and the resting-state functional connectivities always had the worst accuracy in both the HCP and the UKB datasets, because they are (relatively) noisy. However, when using BigFLICA modes to predict 6 new summary features of the connectivity matrices (derived by applying ICA to the matrix of subjects by network matrix edges)^{5}, the accuracy is very high (*r* range from 0.85 to 0.89 for a 100 dimensional BigFLICA decomposition). In addition, when we used IDPs to predict modes of BigFLICA, as shown in **Figs. S2c** and **S2d**, the prediction correlation almost showed a bimodal distribution, which means that some of the FLICA modes can be predicted by the IDPs (mean *r* ≈ 0.8) while others cannot (mean *r* ≈ 0.2). These results further demonstrates that BigFLICA and IDPs span significant complementary variance.

### BigFLICA comparison with mCCA and reproducibility

We next compared BigFLICA against mCCA (eigendecomposition based modelling, which of course also would require similar advances to BigFLICA in order to work on large data; see online Methods). Overall, BigFLICA had (very slightly) improved prediction accuracy (**Fig. S3**), and with slightly more parsimonious modelling (**Fig. S4**). However, with split-half (across subjects) reproducibility testing, BigFLICA components were considerably more reproducible than those from mCCA (median BigFLICA correlation greater than 0.9 in all cases, while many mCCA dimensionalities have median correlation less than 0.5) (**Fig. S5**).

### Examples of BigFLICA modes in the 14k UKB dataset

We now give four examples of significant associations between BigFLICA modes and nIDPs, namely, *Fluid intelligence*, *Age started wearing glasses or contact lenses, Handedness* and *hypertension*. In **Fig. 5**, we show the top four most strongly associated modalities in FLICA modes that correlate with a given nIDP. **Fig. S6** shows the population cross-subject meanmaps for several task and rest fMRI modalities fed into FLICA. This helps give interpretive context for the FLICA mode maps, which depict subject variability in the activity/connectivity relative to these group mean maps.

For *Fluid intelligence*, using all modes (ICs) from the 750 dimensional BigFLICA decomposition as features (predictors) in multivariate elastic net prediction, a cross-validated prediction correlation of *r* = 0.26 is achieved. When we correlated each of the BigFLICA modes and IDPs with the fluid intelligence score in the UKB, we found that several task-fMRI-related BigFLICA modes have the strongest associations (**Fig. 5a**). The first (IC 25) involves task contrast “faces” and “faces>shapes” and the second (IC 57) involves contrast “shapes” and “face” (see **Tables S5** for the full list of these modalities). As the correlation of the mode IC 25 (i.e., its subject weights vector) with fluid intelligence is negative (r=-0.14), this means that the negative-weights voxels (such as in the anterior insula) are positively correlated with intelligence. The fMRI task (Hariri faces-shapes matching^{29}) has, as expected, the greatest population average activation in sensory-motor areas (plus some amygdala involvement due to the emotionally negative nature of the faces), as seen in **Fig. S6**. However, the main brain areas involved in these modes are distinct, including anterior cingulate cortex, frontal pole, inferior frontal gyrus, and anterior insula; it is therefore interesting that the areas found by BigFLICA to be modulated in these components (and found to associate with intelligence) are more “frontal, cognitive” areas than the sensory-motor areas primarily activated on average. The top associations between fluid intelligence and IDPs also involve task-fMRI IDPs (**Tables S3**), but these were a factor of two weaker than associations with BigFLICA modes.

For *Age started wearing glasses or contact lenses*, BigFLICA achieved a prediction correlation of *r* = 0.16. Several resting-state connectivty and task modalities showed associations in primary visual areas (**Fig. 5b**), which is consistent with the fact that this is a vision-related health variable. Lower age of first wearing glasses is correlated with stronger activity in primary visual areas, and also with strength of resting-fMRI connectivity (or functional coherence) within the relevant areas of group-average connectivity; interestingly, in nearby distinct (but still primary visual) areas, there is reduction of correlation (blue voxels), suggesting greater differentiation of primary visual areas.

For *Handedness*, BigFLICA achieved a prediction correlation of *r* =0.23. BigFLICA identified several multimodal, lateralized (or laterally asymmetric) modes, including resting-state mode 14 (left-lateralised language network), task, surface area and white matter tracts (**Fig. 5c**). There are several resting-state connectivity-related IDPs correlated with handedness (**Tables S3**), consistent with a recent study^{30} that also used UKB IDPs, while no IDPs related to other modalities are found significant; in both cases the maximum IDP correlation only reached r=0.12, whereas the strongest association with BigFLICA modes was almost double this.

For a health variable *hypertension* (**Fig. 5d**), BigFLICA achieved a prediction correlation of *r* = 0.22. Several TBSS-related modalities showed consistent associations in the External Capsule tracts. Meanwhile, white matter hyperintensity (T2-Lesion volume) in the corresponding areas is also higher in people with hypertension. Several consistent findings have been reported in the literature^{31–33}.

## 3 Discussion

In this paper, we presented BigFLICA, a multimodal data fusion approach which is scalable and tuneable to analyze the full UK-Biobank neuroimaging dataset, and other large-scale multimodal imaging studies. To the best of our knowledge, this is the first approach for data-driven (unsupervised) multimodal analysis in a brain imaging dataset of this size and complexity. Building on the top of the powerful FLICA model, we proposed a two-stage dimension-reduction approach that combines an incremental group-PCA (mMIGP) and dictionary learning (DicL) to effectively preprocess the multimodal dataset and reduce the computational load of the final FLICA, while maintaining or even improving performance, with as much as a 150-fold “intelligent” reduction in data size. We provide effective ways of choosing the hyper-parameters of BigFLICA, so that it is free of tuning except for choosing the final number of estimated components. Although this approach is motivated by the need for analyzing extremely big neuroimaging data, it is also applicable to other kinds of data such as genetics and behavioural measures. An easy-to-use version of this software will be integrated into an upcoming version of the FSL software package^{34,35}. BigFLICA results on UKB will also be released via the UKB database as new data-driven IDPs (image features), for further epidemiological and neuroscientific research.

A strength of our work is that, unlike previous work that was limited to more moderate datasets and a few phenotypic and behavioural variables^{8–11,36}, we used two of the largest, high-quality multimodal neuroimaging datasets, and thousands of phenotypic and behavioural variables to validate the proposed approach. We demonstrated that BigFLICA is not only much faster than the original FLICA (and can be run on very large data that is simply not analysable with FLICA or other existing methods), but also estimates similar modes with a comparable performance for predicting the non-imaging-derived phenotypes in real data (when tested on a large data subset that is just small enough to allow for comparison against FLICA). We provide insights into the advantages of data-driven multimodal fusion in big datasets by quantitative analysis^{37,38}. First, when comparing BigFLICA with simpler IDP-based approaches (and also single-modality ICA approaches), we demonstrated that a high-dimensional BigFLICA has improved predictive power overall. We demonstrated the value of multimodal fusion instead of analyzing each modality separately. Second, when combining high-dimensional BigFLICA-derived features with IDPs together, the predictive power increased further compared with using either method alone. In addition, when we used BigFLICA-derived features and manually created (with expert knowledge) IDPs to predict each other, they cannot predict each other perfectly (although they are derived from the same imaging data). This indicates that BigFLICA-derived features and IDPs can be complementary to eacho ther, both therefore providing potentially important imaging biomarkers that capture different signal in the imaging data. An interesting finding is that although a high-dimensional BigFLICA has a much higher predictive power than a low dimensional decomposition, a low dimensional decomposition can still explain more than 80% of the total variance of the high dimensional decomposition. This suggests that some the phenotypic and behavioural variables are explained by only small proportions of variance of imaging data. Finally, in addition to the value of using BigFLICA-derived features for relating imaging to non-imaging data, BigFLICA components (particularly at lower dimensionalities) may allow us to learn more about how the different brain imaging modalities (and hence different spatial and biological aspects of the brain’s structure and function) relate to each other.

We see opportunities to improve the current approach. First, BigFLICA is limited to linear feature estimation, while the “ideal, true” information in imaging data may be highly nonlinear. Therefore, a nonlinear extension of BigFLICA, which might be achieved with kernel methods or deep neural networks, is an important area of further research. Second, BigFLICA is an unsupervised dimension reduction and feature generation approach. However, integrating some supervision, i.e., the target variable (such as disease outcomes), into the dimension reduction may boost the performance of the algorithm. Additionally, because BigFLICA generates data-driven features, as opposed to expert-created IDPs, the biological or anatomical interpretation of features is often likely not to be immediately obvious, requiring potentially intensive expert study. Future work could attempt to automate this interpretation process, for example by relating features to existing anatomical templates and atlases, and even by mining imaging literature. Finally, BigFLICA, or extensions, may be an effective way of discovering imaging confound factors^{39} that cannot be found by traditional approaches.

## Author contributions

W.G, C.B and S.S proposed the study, formulated the model, analyzed the data and wrote the manuscript.

## Competing interests

The authors declare that they have no competing financial interests.

## Data availability

BigFLICA will be released in an upcoming version of FSL. BigFLICA-derived features will be available from the UK Biobank database. For UK Biobank, all source data (including raw and processed brain imaging data, derived IDPs, and non-imaging measures) is available from UK Biobank via their standard data access procedure (see http://www.ukbiobank.ac.uk/register-apply). For HCP, data can be downloaded via website (http://humanconnectome.org/data) and ConnectomeDB. Matlab software for performing prediction using elastic-net regression is available at https://github.com/vidaurre/NetsPredict.

## Methods

### FLICA model

The input to FLICA is *K* modalities’ data matrices *Y*^{(k)} with each modality’s dimensions being *N* × *P _{k},k* = 1,…,

*K*, where

*P*is the number of features (e.g., voxels) in modality

_{k}*k*and

*N*is the number of subjects. FLICA aims to find a joint

*L*-dimensional decomposition of all

*Y*

^{(k)}: where

*H*

_{(N×L)}is the shared subject mode (mixing matrix) across modalities, so is a ‘link’ across different modalities, is a positive diagonal mode-weights matrix, is the independent (spatial) feature maps forthe

*L*components of modality

*k*, and is the Gaussian noise term.

### Multimodal extension of MELODIC’s Incremental Group Principal component analysis for subject-space dimension reduction

We propose a multimodal extension of our previous MIGP approach^{18}, termed mMIGP, to reduce the subject dimension of multimodal data. MIGP has been extensively validated in simulations and real neuroimaging data for finding an approximate PCA decomposition in a time- and memory-efficient way^{18}. Suppose that our multimodal data are *K* matrices *Y*^{(k)},*k* =1,…,*K* with dimensions *N* × *P _{k}*, where

*N*is the number of subjects and

*P*is the number of features (e.g. voxels) in a modality. In mMIGP, each feature is z-score normalized first. Then, an MIGP is applied to each modality separately to find an

_{k}*L*

^{★}-dimensional approximate PCA decomposition. Specifically, we want to find an approximation of a singular value decomposition (SVD) of each

*Y*

^{(k)}: where and are the left and right singular vectors, while are the singular values. A naive SVD on

*Y*

^{(k)}scales quadratically with

*N*, which is not efficient when

*N*is large. To find the approximation, MIGP sequentially feed a subset of (columns of)

*Y*

^{(k)}in to an SVD, so that these subsets are reduced to a low-dimensional representation. The low-dimensional representation is then concatenated with another subset of (columns of)

*Y*

^{(k)}, and is fed into another SVD to find the low-dimensional representation of them. The final SVD approximation is found after one pass of all data. The computational complexity of MIPG scales linearly with

*N*. For a detailed description, please see Appendix A of the MIGP paper

^{18}.

The third step is to concatenate all in the component dimension and apply another MIGP for finding a *L*^{★}-dimensional approximate PCA decompositions *U* of size *N*×*L*^{★}, which is a low-dimensional representation of multimodal data in the analysis. Finally, the z-score normalized data *Y*^{(k)} of each modality is projected onto the *U* by:
the *V*^{(k)}, *k* = 1,2,…, *K* are the inputs of subsequent FLICA algorithm. Therefore, the total size of data output by this stage is , which is smaller than the original input size . The fractional reduced data size is *L*^{★}/*N*, and the *L*^{★} can be fixed when more subjects are introduced, so it is scalable in the big-data analysis. In practice, we usually choose *L*^{★} based on the percentage of explained variance of SVD in the third step.

If we feed *V*^{(k)}, *k* = 1,2,…, *K* into FLICA to estimate *L*^{★} FLICA modes, the output subject mode matrix *H*^{★} is of the size *L*^{★} × *L*, so we then simply multiply this by *U* to get the final subject-mode matrix:

The mMIGP approach is equivalent to performing an approximate PCA on feature-concatenated data. The advantage is that it does not need to fit all data into the memory, and even can be parallelized across modalities^{18}. This approach is also equivalent to applying mCCA across all modalities^{40}.

### Sparse dictionary learning for voxel-space dimension reduction

If the resolution of the data is high and the number of modalities is large, applying just the mMIGP reduction still leaves FLICA as being memory and computationally expensive. Therefore, we propose a method that can effectively reduce the voxel dimension, and preserve the important spatial information for subsequent FLICA spatial modelling. Although the most obvious ways of voxel subsampling are either to apply regular spatial down-sampling (similarly, local voxel clustering) or apply PCA *within each modality*, the former only focuses on the local patterns^{22} (and does not adapt downsampling to local variations in redundant information across voxels) while the later empirically finds more global and noise patterns in neuroimaging data, and does not work at all well empirically in this context (see also Allen, et al.^{23} and references therein).

The method we used here is sparse Dictionary Learning (DicL)^{19}, which effectively performs ‘voxel grouping’ in both local and global fashion. It can be used directly on each of the original z-score normalized modalities, i.e., *Y*^{(k)},*k* = 1,2,…,*K*, or on the mMIGP reduced data, i.e., *V*^{(k)},*k* = 1,2,…,*K*. Taking the former as an example, the sparse DicL is adopted here:
where *D*^{(k)} is *sparse spatial dictionary basis*, and *A*^{(k)} is the *feature loadings* with each column representing a linear combination of information from a group of voxels which might either be a local cluster or spatially distributed network. By minimizing an *l*_{1}-regularized sparse-coding objective function, a local optimal solution can be obtained:
where subscript *i* represents the *i*-th column of the corresponding matrix, and *λ* is a regularization parameter. The *l*_{1}-regularization term enforces that the learned spatial loadings *D*^{(k)} are sparse. The objective function can be efficiently optimized by a block-coordinate descent optimizer with warm restarts. It has been implemented in the SPAMS package (http://spams-devel.gforge.inria.fr/). Compared with simply using PCA in this step, sparse DicL has three advantages: (1) the spatial loading matrix *D*^{(k)} can be sparse, so a smaller number of voxels are involved in each column of the dictionary; (2) the columns of the dictionary do not need to be orthogonal to each other, which is more flexible; (3) an “overcomplete” dictionaryis allowed, i.e., the number of dictionary basis vectors can exceed the minimum of *N* and *P _{k}*, which further increases the flexibility.

After the above modality-wise DicL, the final inputs to FLICA are the matrices *A*^{(k)}, *k* = 1,2,…, *K*, of size *N* × *L*^{★★} if we use *Y*^{(k)}, *k* = 1,2,…, *K*, or *L*^{★} × *L*^{★★} if we use *V*^{(k)}, *k =* 1,2,…, *K*. Note that (unlike the typical approach of feeding spatial PCA eigenvectors into ICA) we are not feeding the spatial dictionary basis (*D*^{(k)}) into the FLICA core modelling, but the feature loadings (*A*^{(k)}). To get the spatial loading matrices from FLICA, we do voxel-wise multiple regression where the target variable is a voxel and the design matrix is the FLICA subject mode. We could change the order by applying DicL first and then mMIGP, but this empirically has a lower computation efficiency.

### Evaluation of BigFLICA in simulations

We simulated 500 subjects, and each had two modalities, which were both 30 × 30 × 30 images. We first simulated *K* ground-truth (independent) spatialmaps *X*; each of these was a 30 × 30 × 30 image. The spatial maps were a weighted sum of two Gaussian white noise images, where the first one was 30 × 30 × 30 with weight 0.05, and the second was a 5 × 5 × 5 cube randomly located in the full image with weight 0.95. Then, random positive component weights *W*, Gaussian random subject loadings *H* and Gaussian white noise terms *E* were simulated. Finally, after vectorizing each spatial map and noise term, the data for a single modality *Y* was generated as *Y* = *HWX* + *σE*, where *σ* was a parameter to control the signal-to-noise ratio (SNR). A small amount of spatial smoothing using a Gaussian kernel was applied to spatial maps *X* and noise terms *E* to mimic real image data. Each of the two modalities also had 5 unique spatial maps that were not shared by each other. The voxels were z-score normalized before feeding into the subsequent FLICA analysis. The SNR was defined as: *var*(*vec*(*HW X*))/*var*(*vec*(*σE*)).

Performance evaluation: When FLICA was applied to the simulated data, the number of independent components was always set to the ground truth *K*. The performance was measured by the similarity between estimated subject-mode matrix *H*^{★} and the ground truth *H*. The similarity was measured by the greedy matching of the components based on maximum correlation and then estimating the mean correlations across components.

Evaluation of mMIGP for subject-space dimension reduction: After generating simulated data, we reduced the data to varying dimensions (*L*^{★} = 50, 100, 200, 300, 400) using mMIGP, and then fed the reduced data into FLICA. This was compared with the original FLICA. The number of ground-truth components was set to 25,35,45 and the SNR was set to 4, 1, 0.25, 0.04. All simulations were repeated 50 times.

Evaluation of DicL for voxel-space dimension reduction: To evaluate the influence of the DicL parameters on the subsequent FLICA results, we performed the DicL on simulated data using varying parameter combinations (*Λ* = 0.1 to 16 and *L*^{★★} = 100 to 3000) followed by FLICA (nIC = 25, 50, 100). This was compared with the original FLICA. The SNR was set to 4,1,0.25,0.04, and the number of iterations for the DicL was set to 50, because we empirically find that this number of iterations is sufficient for DicL to converge to a stable result in simulation and real data. All simulations were repeated 50 times.

### HCP and UK Biobank data

The voxel/vertex-wise neuroimaging data of 81 different modalities of 1,003 subjects from the HCP S1200 data release were used in this paper^{2}. The preprocessing was conducted by the HCP team using an optimized pipeline^{41}. The 81 modalities included (1) 25 resting-state ICA dual-regression spatial maps (z-score normalized); (2) 47 unique task contrast maps as z-statistics from 7 differentfMRI tasks; (3) 3 T1-image derived modalities (grey matter volume, surface area, surface thickness);(4) 6 Tract-Based Spatial Statistics (TBSS) features from diffusion MRI (FA, L1, L2, L3, MD, MO)^{42}. In addition, 158 nIDPs were used here, which was the same as our previous study^{43}. Names of nIDPs are in **Supplementary File 1**.

The UK Biobank imaging data were mainly preprocessed by FSL^{34,35} and FreeSurfer^{44} following an optimized pipeline^{45} (https://www.fmrib.ox.ac.uk/ukbiobank/). The voxel-wise neuroimaging data of 47 modalities of 14,053 subjects were used in this paper, including: (1) 25 “modalities” from the resting-state fMRI ICA dual-regression spatial maps (z-score normalized); (2) 6 modalities from the emotion task fMRI: 3 contrasts (shapes, faces, faces>shapes) of z-statistics and 3 *contrasts of parameter estimate* maps; (3) 10 diffusion MRI derived modalities (9 TBSS features, including FA, MD, MO, L1, L2, L3, OD, ICVF, ISOVF^{42,46} and a summed tractography map of 27 tracts from AutoPtx in FSL); (4) 4 T1-MRI derived modalities (grey matter volume and Jacobian map (which shows expansion/contraction generated by the nonlinear warp to standard space, and hence reflects local volume) in the volumetric space, and cortical area and thickness in the Freesurfer’s fsaverage surface space); (5) 1 susceptibility-weighted MRI map (T2-star); (6) 1 T2-FLAIR MRI derived modality (white matter hyperintensity map estimated by BIANCA47). A detailed description is in **Table. S5**. In addition, the 8,787 nIDPs were included, butwe retained the 7,245 of those, that have at least 1,000 non-missing values (subjects). Names of nIDPs are in **Supplementary Files**. Group-level resting-state independent component spatial maps and task activation z-statistic maps are in the **Supplementary Files**.

When carrying out nIDP prediction, a total of 13 and 54 confounding variables were regressed out from nIDPs using linear regression in the HCP and the UKB datasets respectively (**Supplementary Materials**). Subjects with a missing modality were imputed by the mean value of all other subjects. We did not impute the missing nIDPs.

### Comparing BigFLICA with the original FLICA on real data

On real data, we do not know the ground truth components, and the data may not follow the assumptions of ICA. Therefore, we rely on the performance of predicting nIDPs as a surrogate criterion to evaluate different methods. We applied the proposed mMIGP approach to HCP data and a subset of 1,036 UKB subjects (so that the original FLICA is computationally tractable). Elastic-net regression, from the glmnet package^{48}, was used to predict the nIDPs using FLICA’s subject modes as model regressors (features). This approach is widely-used and has been shown to achieve a robust and state-of-the-art performance in many neuroimaging studies^{24,25}. To evaluate the model performance, for each nIDP, we used 5-fold cross validation, and compute Pearson correlation between the predicted and true values of each nIDP across the 5 test sets. As there are tuning parameters within the Elastic-net regression, in each training set, we performed a nested 5-fold cross validation to tune the model parameters, and used the best model selected in the nested 5-fold cross validation to do the prediction in the test set. When comparing any two approaches, the same training-validation-testing split was used. The prediction accuracy was quantified as the Pearson correlation between predicted and the true values of each nIDP in the test sets.

To evaluate MIGP preprocessing, we reduced the dimension to varying *L*^{★} (from 100 to 500) using MIGP first and then used FLICA to extract *L* = 50 components. The original FLICA was also applied to extract 50 components. To evaluate DicL preprocessing, we used the DicL (dictionary dimension = 2000 and sparsity parameter *λ* = 1) to reduce the data dimension of each modality followed by the FLICA to extract varying numbers of components (nIC= 25, 50, 100, 200, 300). The original FLICA was also applied to extract the same numbers of components. The prediction accuracy of BigFLICA was compared with the original FLICA applied on non-reduced data.

### Statistical significance of difference of prediction accuracy between two approaches

To compare the overall prediction accuracy of two approaches (e.g. BigFLICA with mMIGP preprocessing vs. the original FLICA), we estimate the statistical significance of the difference between the prediction correlations across nIDPs. Suppose that we have a total of *p* nIDPs, we first filter out a subset of nIDPs where both methods have low prediction accuracy (*r* < 0.1 in our analysis), resulting in *p*_{1} nIDPs. If we perform a simple paired t-test, the correlation structures among nIDPs makes the samples dependent with each other, so that the p-value is not valid. Based the fact that a paired t-test is a special case of general linear model (where the *y* variable is the difference of the prediction accuracy, and the *x* variable is a column of ones, and the statistical significance is the significance of the coefficient of *x*), we used a weighted least square approach (by lscov function in Matlab) to get a reliable statistical significance estimation by taken the covariance structures between nIDPs (which is estimated as the covariance of the nIDPs-by-subject matrix) into account.

### Parameter settings of running BigFLICA in the full HCP and UKB datasets

We applied BigFLICA approach to extract varying number of target components in two datasets. In HCP, we used FLICA with DicL preprocessing only (dictionary dimension 2000 and *λ* = 1). In UKB, we used FLICA with both mMIGP and DicL preprocessing (dictionary dimension 5000, *λ* = 1 and mMIGP dimension 1000 (*>*95% explained variance)). The number of FLICA VB iterations is 1000.

### Comparing BigFLICA with multiple independent single-modality ICA decomposition

ICA is a widely-used approach for decomposing single-modality neuroimaging data, including functional MRI^{43} but also in structural MRI^{49} and diffusion MRI^{50}. A natural question arises whether BigFLICA is able to combine multimodal information more effectively than the single-modality approaches such as ICA (we used the fastICA algorithm^{51}), which ignores inter-modality relationships.

We first performed ICA on each modality of HCP and UKB data separately to extract 25,100, 250, 500 and 750 components. For a given component number, we built a prediction model using the concatenated ICA subject modes (across modalities) to predict each of the nIDPs. To be fair, for BigFLICA, we extract the same number of ICs to build the prediction model. For example, in the UKB data and a 25-dimensional decomposition, the predictor is a Subject×(25 × 47) matrix for single-modality ICA, where 25 is the number of components in each modality and 47 is the total number of modalities. For BigFLICA, the predictor is a Subject× 25 matrix. This is arguably a fair comparison because each of the BigFLICA modes potentially contains information from all modalities. The method to build a predictive model and evaluate this is the same as above, except that when we used the concatenated ICA subject modes, we added a univariate screening step in the training set to select the top 300 most informative features according to their correlation with an nIDP in the training set. This step, in general, boosts the predictive accuracy because the dimensionality of concatenated ICA modes is usually very high, so that many of the modes are pure noise with respect to any given nIDP. Therefore, the univariate screening can help the elastic-net regression to filter out noisy features effectively. We did not perform univariate screening when using the BigFLICA subject modes to predict nIDPs.

Besides the main results, in **Fig. S1**, we also compared, in the UKB data, the 750-dimensional BigFLICA decomposition with the 25-dimensional ICA decomposition concatenated across modalities, i.e., we have 25 × 47 features in the single-modality ICA. In this comparison, the number of features for the two methods are almost the same, but we can see that BigFLICA clearly outperforms the single-modality ICA.

### Comparing BigFLICA with hand-curated imaging-derived phenotypes

A popular choice of data analysis strategy is to extract imaging features based on expert knowledge (e.g., regional volumes and thickness, and resting-state functional connectivities between brain regions), often referred to as IDPs^{1}. Brain IDPs have been shown to genetically correlate with many SNPs in our previous genome-wide association study (GWAS) in UK Biobank^{5}, and they have been shown to change in many psychiatric diseases^{26–28}.

We extracted 5,812 IDPs from the HCP, including (1) 199 structural MRI features from Freesurfer as provided by the HCP; (2) 4700 regional mean task activations from 47 independent task contrasts using a 100-dimensional parcellation atlas^{52}; (3) 625 functional connectivities (FCs) based on a 25-dimensional ICA parcellation with partial correlation to estimate FCs; (4) 288 regional mean TBSS features (FA, L1, L2, L3, MD, MO) using the Johns Hopkins University tract atlas. The names of these IDPs are given in the **Supplementary File 3**.

We used 3,913 IDPs from UKB, including global and local features from the 6 imaging modalities (T1, T2-FLAIR, swMRI, tfMRI, rfMRI, and dMRI)^{53}. The names of these IDPs are given in the **Supplementary File 4**.

We built prediction models that use IDPs or BigFLICA modes to predict each of the nIDPs using the same strategy as above. The FLICA dimension is set to 25, 100, 250, 500, 750. In addition, we also concatenated IDPs and each of the BigFLICA subject modes together to predict the nIDPs, and the performance is compared with using IDPs alone. We used a univariate screening step to select the top 300/500 most informative IDPs according to their correlation with an nIDP in the inner-fold (i.e., training set) of HCP/UKB. Finally, we also built models that use IDPs to predict each of the FLICA subject modes and vice versa, aiming to evaluate the shared variances between features extracted by these two different approaches in the same data.

### Reproducibility of BigFLICA

To test whether BigFLICA’s spatial independent components are estimated reliably, the whole UKB dataset was divided into two parts: the first part contained 7,000 subjects and the second part contained the remaining 7,503 subjects. We applied BigFLICA to the two parts separately. After estimating the subject modes, we reconstructed the z-score normalized (voxel-wise) spatial maps of each modality by regressing the subject mode against the mMIGP-reduced data. The spatial independent components of each modality were concatenated spatially and greedily paired, based on the absolute correlation between two runs. When we computed the correlations, only voxels whose absolute z-scores that are both larger than 3 in two runs were preserved (to reduce noise, given that there are huge numbers of empty voxels across all modalities for a given FLICA component in general; this does not bias the metric of reproducibility towards finding common similar patterns). **Fig. S5** (left) shows that the FLICA components have very high reproducibility in the split-half test across a varying number of components.

### Comparing BigFLICA with mCCA

We tested whether BigFLICA (independent components-based spatial modelling) was better than mCCA (eigendecomposition based modelling, which could be considered to be similar to the output of BigFLICA without running the final core FLICA unmixing - note that to enable mCCA to run requires the same mMIGP initial processing that we have added in this work) in three ways. The number of extracted components was the same when performing this comparison. First, for the prediction accuracy of nIDPs, **Fig. S3** shows that, in the UKB data, BigFLICA has a (very slightly) improved prediction accuracy compared with mCCA. Then, we proposed a hypothesis that modes of BigFLICA are more parsimonious features of nIDPs compared with mCCA, or in other word, a smaller number of modes of BigFLICA can predict the nIDPs. Results shown in **Fig. S4** validate this hypothesis: for a given number of components and a given nIDP, BigFLICA modes have a (on average) higher proportion of zero weights in the elastic-net predictions, when compared with mCCA modes. The advantage is that a more parsimonious representation usually has a better biological interpretability. Finally, we estimated and compared the split-half reproducibility of BigFLICA and mCCA. As shown in **Fig. S5** (right), BigFLICA has a much higher between-subject reproducibility than mCCA.

### Contribution of different modalities in a BigFLICA decomposition

Besides using BigFLICA for exploring the relationships between imaging and non-imaging phenotypic and behavioural data, we can also use it to investigate the relationship between different modalities. For each mode, BigFLICA estimates a vector of positive numbers reflecting the contributions of different modalities (i.e., the diagonal of each *W*_{(k)}, where the higher the number, the more important is one modality to a mode). We concatenated all such vectors across all modes so that it is a mode-by-modality matrix *W*, and normalized each column to sum to one. Six examples of such matrices are shown in **Fig. S7**, with different numbers of estimated modes in the UKB dataset.

We then calculate each row’s sum (across columns) in *W*, thereby reflecting the overall contribution of each modality in the BigFLICA decomposition. As shown in **Fig. S8**, across all FLICAdimension-alities (numbers of estimated modes), each of the 25 resting-state fMRI dual-regression spatial maps usually has a low overall contribution, followed by task fMRI maps, while modalities reflecting more about structure of the brain (e.g., structural MRI and diffusion MRI) generally have high overall contributions. The relative differences of modality contribution between functional MRI-related modalities and structural/diffusion MRI-related modalities become larger with increasing number of estimated modes. We further estimated the total shared variances between a lower dimensional BigFLICA decomposition and a higher dimensional decomposition. **Table S4** shows that a higher dimensional decomposition explains almost all variances of a lower dimensional decomposition (upper triangle of the table), while a lower dimensional decomposition can explain a large proportion of the variances of a higher dimensional decomposition.

### Relationship between different modalities in a BigFLICA decomposition

We calculated the cosine similarity between different columns of *W* (using the 750-dimensional BigFLICA decomposition), to measure the similarity of different modalities in terms of their contribution to the BigFLICA decomposition, i.e., the more similar information two modalities carry, the more likely they will have similar contribution to a mode. **Fig. S9a** shows that the modality relationship matrix is clearly grouped into three large clusters. The first is all resting-state modalities, while the second is the task fMRI maps, and the third is the diffusion MRI, structural MRI-related modalities and swMRI. The white matter hyperintensity map (T2 lesions) forms a single cluster. As a comparison, we also performed a 50-dimensional ICA decomposition within each modality, and calculated the shared variances between every pair of 50 ICs in two modalities using a simple multivariate regression model. As shown in Fig. S9b, we also observed a similar pattern as Fig. S9a. The main difference is that in Fig. S9a, there are relatively stronger correlations within resting-state modalities and between resting-state and other modalities, but weaker correlations between task modalities and structural related modalities. These results reflect the fact that the multimodal modelling effects of BigFLICA learn different inter-modality relationships compared with single-modality ICA.

## Supplementary Materials

### Confounding variables regressed out in our analysis

UKB dataset: age, age squared, age X sex, age squared X sex, age (quantile normalised), age squared (quantile normalised), age X sex (quantile normalised), age squared X sex (quantile normalised), rfMRI head motion, tfMRI head motion, head size scaling, rfMRI head motion squared, tfMRI head motion squared, [4] confounds relating to bed position in scanner (x), [4] confounds relating to bed position in scanner (y), [4] confounds relating to bed position in scanner (z), [4]confounds relating to bed position in scanner (table), [4]confounds relating to bed position in scanner (x) squared, [4] confounds relating to bed position in scanner (y) squared, [4] confounds relating to bed position in scanner (z) squared, [4]confounds relating to bed position in scanner (table) squared, [10] confounds modelling slow date-related drift 1, [10] confounds modelling slow date-related drift 2, [10] confounds modelling slow date-related drift 3, [10] confounds modelling slow date-related drift 4, [10] confounds modelling slow date-related drift 5, [10] confounds modelling slow date-related drift 6, [10] confounds modelling slow date-related drift 7, [10] confounds modelling slow date-related drift 8, [10] confounds modelling slow date-related drift 9, [10] confounds modelling slow date-related drift 10, rfMRI head motion (quantile normalised), tfMRI head motion (quantile normalised), head size scaling (quantile normalised), [4] confounds relating to bed position in scanner (x) (quantile normalised), [4] confounds relating to bed position in scanner (y) (quantile normalised), [4] confounds relating to bed position in scanner (z) (quantile normalised), [4] confounds relating to bed position in scanner (table) (quantile normalised), [4] confounds relating to bed position in scanner (x) squared (quantile normalised), [4] confounds relating to bed position in scanner (y) squared (quantile normalised), [4] confounds relating to bed position in scanner (z) squared (quantile normalised), [4] confounds relating to bed position in scanner (table) squared (quantile normalised), [10] confounds modelling slow date-related drift 1 (quantile normalised), [10] confounds modelling slow date-related drift 2 (quantile normalised), [10]confounds modelling slow date-related drift 3 (quantile normalised), [10]confounds modelling slow date-related drift 4 (quantile normalised),[10]confounds modelling slow date-related drift 5 (quantile normalised), [10] confounds modelling slow date-related drift 6 (quantile normalised), [10] confounds modelling slow date-related drift 7 (quantile normalised), [10] confounds modelling slow date-related drift 8 (quantile normalised), [10] confounds modelling slow date-related drift 9 (quantile normalised), [10]confounds modelling slow date-related drift 10 (quantile normalised), imaging centre, sex.

HCP dataset: image reconstruction version, age, age squared, sex, age X sex, age squared X sex, race, ethnicity, rfMRI motion, Height, Weight, FS_IntraCranial_Vol, FS_BrainSeg_Vol.

## Acknowledgments

We are grateful to UK Biobank and its participants (access application 8107). We thank the WU-Minn HCP Consortium for their invaluable contributions in generating the publicly available HCP data and implementing the procedures needed to acquire, analyze, visualize and share these datasets. Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. Financial support was provided by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z.