Multimodal subspace independent vector analysis captures latent subspace structures in large multimodal neuroimaging studies

We present Multimodal Subspace Independent Vector Analysis (MSIVA), a methodology to capture both joint and unique vector sources across multiple data modalities by defining linked and modality-specific subspaces. In particular, MSIVA enables the estimation of independent subspaces of various sizes within modalities and their one-to-one linkage to corresponding subspaces across modalities. We compare MSIVA to a fully unimodal initialization baseline and a fully multimodal initialization baseline, and evaluate all three approaches with five distinct subspace structures on synthetic and neuroimaging datasets. We first demonstrate that MSIVA and the unimodal baseline can identify the correct ground-truth subspace structures from the incorrect ones in multiple synthetic datasets, while the multimodal baseline fails at detecting high-dimensional subspace structures. We then show that MSIVA can better capture the latent subspace structure with the minimum loss value in two large multimodal neuroimaging datasets compared to the unimodal baseline. Our results from subsequent per-subspace canonical correlation analysis (CCA) and brain-phenotype modeling demonstrate that the sources from the optimal subspace structure are strongly associated with phenotype measures, including age, sex and schizophrenia-related effects. Our proposed methodology MSIVA can be applied to capture linked and unique biomarkers from multimodal neuroimaging data.


Introduction
Multiple neuroimaging techniques such as magnetic resonance imaging (MRI) have been developed to uncover the structural and functional properties of the brain.However, each neuroimaging modality has its own strengths and weaknesses, and only captures limited information of the brain.For example, structural MRI (sMRI) can show high-resolution anatomical structure of the brain but cannot capture temporal dynamics, while functional MRI (fMRI) can measure blood-oxygenationlevel-dependent (BOLD) signals across time at the cost of lower spatial resolution.Therefore, it is important to take advantage of multimodal neuroimaging to effectively capture interpretable and multifaceted information about the brain.
To jointly analyze multiple datasets or data modalities, a unified framework Multidataset Independent Subspace Analysis (MISA) [1] has been developed, encompassing multiple blind source separation methods, such as independent component analysis (ICA) [2], independent vector analysis (IVA) [3], and independent subspace analysis (ISA) [4].MISA can be applied to identify latent sources from multiple neuroimaging modalities including sMRI and fMRI [1].More recently, a multimodal IVA (MMIVA) fusion method built upon MISA has been proposed to identify linked biomarkers related to age, sex and psychosis in two large multimodal neuroimaging datasets [5].MMIVA assumes that sources are independent within each modality, i.e. the subspace structure is an identity matrix.However, there may also exist linkage among sources in the same neuroimaging modality, potentially grouped by their anatomical or functional properties, which are not optimally captured by MMIVA.Aiming to better detect linkage across sources and modalities, we present a novel methodology, Multimodal Subspace Independent Vector Analysis (MSIVA) [6], that captures linkage of vector sources by defining joint and unique subspaces.MSIVA is built upon MMIVA by defining a block diagonal matrix as the subspace instead of the identity matrix used in MMIVA.In addition, MSIVA is initialized with the weight matrices obtained by combining multimodal group principal component analysis (MGPCA) across modalities with separate ICAs for each modality.By design, MSIVA can simultaneously capture two types of latent sources: those linked across all modalities, as well as those unique to a specific modality.
We compare MSIVA with a fully unimodal initialization approach and a fully multimodal initialization approach.We demonstrate that MSIVA can successfully reconstruct both joint and unique sources, revealing the correct subspace structures in multiple synthetic datasets.We then apply MSIVA and the baseline approach on two large multimodal neuroimaging datasets, the UK Biobank dataset [7] and a mixed schizophrenia (SZ) patient dataset from several studies [8,9,10].Our results indicate that MSIVA can better capture linked and modality-specific sources in the neuroimaging datasets with lower loss values compared to the baseline approach.Using canonical correlation analysis (CCA) [11], we conduct a follow-up assessment of each identified subspace separately and find projections within the optimal subspace structure yielding the post-CCA linked sources.We then perform an age regression task, a sex classification task, and an SZ classification task to evaluate the associations between these linked sources and phenotype measures.Results from brain-phenotype modeling suggest that the post-CCA sources are associated with age, sex and SZ-related effects.

Subspace Structure
Our interest lies on identifying groups of linked (i.e, not independent) sources within each modality, while assuming sources in different groups are statistically independent.These groups are referred to as subspaces.In addition, we aim to detect cross-modal linkage (i.e., statistical dependence) between subspaces.This requires solving a challenging combinatorial optimization problem.To simplify, we limit the search space of cross-modal linkages by assuming that links occur only between subspaces with the same size across modalities.Additionally, we assume all modality-specific subspaces to be 1D, i.e., a single source.
Building on the MISA framework, we require a user-defined candidate subspace structure describing the expected linkage pattern.The goal of MSIVA is to determine which of the candidate subspace structures best fits the data.In particular, as shown in Figure 1, we propose five plausible subspace structures (S 1 − S 5 ) in two modalities (M 1 − M 2 ), all with 12 sources in each modality: • S 1 : One 2-dimensional (2D) linked subspace, one 3D linked subspace, one 4D linked subspace, and three 1D modality-specific subspaces.
These choices are based on the functional imaging literature where two to four dimensions are commonly used to cluster functional networks [12,13].

Multimodal Subspace Independent Vector Analysis
Given our constraint that linked cross-modal subspaces must have the same size, we refer to our proposed approach as Multimodal Subspace Independent Vector Analysis (MSIVA).MSIVA is designed to capture both linked and modality-specific subspaces.
Three initialization workflows are considered for MSIVA, as shown in Figure 2. In all cases, given a candidate subspace structure, MSIVA consists of iterative combinatorial optimization (CO) of the source estimates (cross-modal subspace alignment) and numerical optimization of the MISA loss (see Equation 3).This process is repeated for each of five candidate subspace structures (S 1 − S 5 ) and followed by a best-fit determination based on the final loss values from all candidates.The following subsections explain each of these steps in detail.

MSIVA Initialization
The MSIVA approach first utilizes multimodal group principal component analysis (MGPCA) to identify common principal components across two modalities and then applies separate ICA on the MGPCA-reduced data from each modality.The multimodal data matrices X [m] ∈ R Vm×N are reduced to C principal directions by MGPCA.Unlike principal component analysis (PCA) that identifies orthogonal directions of maximal variation for each modality separately, MGPCA identifies directions of maximal common variation across all modalities.Eigenvectors are computed based on the average of the scaled covariance matrices: where Σ N is the ratio of the variance in the modality to the number of samples.We define the whitening matrices whtM [m] as follows: where Λ and H are the top C eigenvalues (with the largest absolute values) and eigenvectors of Σ avg , respectively, M trace(Σ [m] ) .Next, the MGPCA-reduced data from each modality X [m] r undergoes a separate ICA estimation using the Infomax objective [14] r (n) per modality.These estimates are then optimized by running MISA as a unimodal ICA model initialized with F whtM [m] from both modalities.Following, MSIVA is compared with a fully unimodal initialization approach and a fully multimodal initialization approach.

Unimodal Initialization
The unimodal initialization approach simply applies PCA and ICA on each modality separately.We first project the data matrix from each modality X [m] into a reduced data matrix X [m] r with C principal components.Next, we run ICA on each reduced data matrix X [m] to obtain C independent sources.

Multimodal Initialization
The multimodal initialization approach sequentially applies MGPCA and group ICA (GICA) across all modalities.GICA performs ICA on the MGPCA reduced data from both modalities combined, i.e., X r = M m=1 X [m] r .

Alternating Combinatorial and Numerical Optimizations
All three approaches utilize MISA's greedy combinatorial optimization (CO) and objective to recover sources.Greedy CO and MISA optimization are run iteratively for 10 and 20 iterations on synthetic and neuroimaging data, respectively, until the loss value converges.MISA uses the relative gradient and the L-BFGS [15] in a barrier-type optimization (fmincon from MATLAB's Optimization Toolbox).Finally, we identify the optimal subspace structure based on the lowest loss value obtained.The loss function L [1] is defined as the Kullback-Leibler (KL) divergence between the joint Kotz distribution [16] of all sources p(s) and the product of the joint Kotz distribution of sources at each of K subspaces q(s) = K k=1 p(s k ).Thus, subspaces are expected to be statistically independent of one another and may include multimodal sources.We want to minimize the loss L by solving the following optimization problem: where W is the unmixing matrix such that s(n) = Wx(n) and P k is the k-th subspace assignment matrix defined by the subspace structure S.

Datasets 2.3.1. Synthetic Data
For each subspace structure, we generate a synthetic dataset X [m] ∈ R V ×N , where m is the modality index (m ∈ {1, 2}), V is the input feature dimensionality (V = 20000) and N is the number of samples (N = 3000).V and N are chosen to approximate the number of features and samples in the UK Biobank (UKB) neuroimaging dataset [7].X [m] is a linear mixture of 12 sources spanning the defined subspaces.Each subspace is independently sampled from a multivariate Laplace distribution (the marginal distributions correspond to different sources).Sources in the same subspace, but assigned to different modalities, are dependent with a correlation coefficient ranging from 0.65 to 0.85.Sources in the 1D subspaces from S 1 to S 4 are independent from all others, i.e., correlation coefficient is 0.

Experiments
For each of five subspace structures (S 1 − S 5 ), we first generate a synthetic dataset from the ground-truth subspace structure.Then we perform the experiments on all combinations of three initialization approaches and five subspace structures.Finally we measure the normalized multidataset Moreau-Amari intersymbol interference (ISI) [1,17,18], a metric to evaluate the residual differences between the recovered sources and the ground-truth sources, as well as the loss values.The synthetic data experiments aim to verify whether the proposed approaches including MSIVA can identify and distinguish the correct subspace structure used for data generation from the incorrect ones.
We then perform the same experiments on two multimodal neuroimaging datasets using these five candidate subspace structures, and identify the optimal structure yielding the lowest final MISA loss value.Separate follow-up CCA of each subspace recovered projections with maximum cross-modal correlation.

Phenotype Prediction
To further evaluate that association between the post-CCA sources and phenotype measures, we perform an age prediction task and a sex classification task for the UKB dataset, as well as an age prediction task and a binary SZ classification task (controls vs SZ patients) for the patient dataset.Specifically, we train a ridge regression model to predict the age, and a linear support vector machine to classify sex groups or SZ patients.For the UKB dataset, 2907 subjects are divided into a training set of 2000 subjects and a hold-out test set of 907 subjects.For the patient dataset, 999 subjects are divided into a training set of 699 subjects and a hold-out test set of 300 subjects in the age prediction task; 875 controls and SZ patients are grouped into a training set of 612 subjects and a test set of 263 subjects in the SZ diagnosis classification task.We perform 10-fold cross-validation to choose the best hyperparameters (regularization parameter range: [0.1, 1]) on the training set, then train the model using all training subjects and evaluate it on the hold-out test set.The age regression performance is measured by mean absolute error (MAE) and the classification performance is measured by accuracy.
Table 1: Synthetic data results: Final MISA loss values (the lower the better).Each row represents a ground-truth (GT) subspace structure used to generate the data and each column represents a test subspace structure used to fit the model.The lowest loss value along the row is highlighted in bold, which determines the selected subspace.Approaches performing consistently well will contain bold loss values only along the diagonal.The unimodal initialization approach correctly identifies all five subspace structures.MSIVA correctly identifies all cases except for S 4 .The multimodal initialization approach fails at detecting S 1 and S 4 .

Unimodal Initialization (PCA+ICA)
S   Table 2: Neuroimaging data results: Final MISA loss values (the lower the better).MSIVA with the subspace structure S 2 outputs the lowest loss values in both multimodal neuroimaging datasets, and thus is considered as the optimal approach to capture the latent subspace structure in neuroimaging data.

Synthetic Data Results
As shown in Figure 3, the unimodal initialization (PCA+ICA) and the MSIVA initialization (MGPCA+ICA) lead to the lowest ISI values (≤ 0.02) along the main diagonal, demonstrating that both approaches can correctly distinguish the ground-truth subspace structures from the incorrect ones.According to Table 1, the loss values are almost consistent with the ISI results, except that MSIVA misidentifies S 4 .We also note that the loss value from MSIVA is slightly smaller than that from the unimodal initialization for each correct subspace structure, suggesting that MSIVA can fit the data slightly better than the unimodal initialization.However, the multimodal initialization with MGPCA and GICA fails at correctly detecting the ground-truth subspace structure with highdimensional 4D subspace(s), i.e. S 1 and S 4 .The multimodal initialization is considered worse due to the high ISI value (0.07) in the main diagonal and, thus, is excluded from neuroimaging data analysis according to the simulation results.
The recovered subspace structures from MSIVA and the unimodal initialization align with the proposed ground-truth ones (Figure 4).Block permutation of the subspaces is allowed, as long as alignment between modalities is retained (showing sorted result for ease of interpretation).Again, the multimodal initialization (MGPCA+GICA) fails at recovering the ground-truth subspace structures S 1 and S 4 , suggesting that cross-modal alignment in high-dimensional subspaces may increase the difficulty of the optimization problem.

Neuroimaging Data Results
In the UKB neuroimaging dataset, we observe that within-modal self-correlation patterns align with the predefined subspace structures (Figure 5, rows I-II and IV-V).We note that MSIVA recovers stronger cross-modal correlation than the unimodal baseline for all predefined subspace structures (Figure 5, rows III and VI).MSIVA S 2 yields the lowest final MISA loss value 46.77 across all cases (Table 2), suggesting that the subspace structure S 2 best fits the latent structure of this dataset.
Similarly, in the patient dataset, MSIVA shows stronger cross-modal correlation for all subspace structures (Figure 6, rows III and VI).Same as the UKB dataset, MSIVA S 2 yields the lowest final loss value 45.67 across all cases (Table 2).
We then identify the associations between the phenotype measures and the sources captured by MSIVA S 2 .In the UKB dataset, visual inspection of individual variability from the cross-modal CCA projections in each linked subspace (Figure 7) suggests that subspaces 1, 3, 4 and 5 are associated   with aging (especially cross-modal source 9 in subspace 5), while subspaces 2 and 4 show sex effect (especially cross-modal source 7 in subspace 4).The age regression and sex classification results also confirm this finding (Table 3).Specifically, the age prediction mean absolute error (MAE) in subspace 5 is the lowest (5.400 years), and the sex classification accuracy is the highest in subspace 4 (0.812).In the patient dataset, we observe aging effect in source 3 from subspace 2 and source 9 from subspace 5, and psychosis effect in source 4 from subspace 2 and source 10 from subspace 5 according to cross-modal CCA projections in each linked subspace (Figure 8), verified by the age regression and diagnosis classification result (Table 3).No significant sex effect is found in the patient dataset.
The dual-coded spatial maps [19] from sources in both modalities are presented in Figure 9, where the voxel intensity controls both the color and opacity.In the UKB dataset, source 7 from subspace 4 shows the strongest sex effect, while the source 9 from subspace 5 shows the strongest aging effect.The sex effect can be found in the occipital lobe (sMRI, fMRI), the frontal lobe (sMRI, fMRI) and the cerebellar region (sMRI).The age effect is shown in the cerebellar region (sMRI), the occipital lobe (fMRI), as well as the sensorimotor and visuomotor areas (sMRI, fMRI).In the patient dataset, the SZ effect can be found in the cerebellar area (sMRI), the temporal lobe (sMRI), the frontal lobe (sMRI, fMRI) and the occipital lobe (sMRI, fMRI).The age effect can be seen in the occipital lobe (sMRI, fMRI) and the frontal lobe (fMRI).

Discussion
We present a novel methodology, Multimodal Subspace Independent Vector Analysis (MSIVA), to capture linked cross-modal sources as well as modality-specific sources.We first show that both MSIVA and the unimodal baseline can correctly identify the ground-truth subspace structures from the incorrect ones, and verify that the correct subspace structures result in the lowest loss values from multiple synthetic data experiments.We next apply both approaches on two large multimodal neuroimaging datasets and demonstrate that MSIVA can better reveal the latent subspace structure across two imaging modalities with lower loss values.Among all cases, MSIVA S 2 outputs the lowest loss value and, thus, is considered as the best fit to the latent structure in both neuroimaging datasets.The CCA projections within each shared subspace are strongly associated with age, sex and SZ-related effects, as verified through the phenotype prediction tasks.The age, sex and SZ-related spatial maps align with previous findings [5].
Furthermore, MSIVA can be viewed as an extension of MMIVA except that their initialization methods (MSIVA: MGPCA+ICA initialization; MMIVA: MGPCA+GICA initialization) and subspace structures (MSIVA: flexible subspace structures; MMIVA: S 5 ) are different.To further investigate the estimated sources from MSIVA and MMIVA, we compare MSIVA with the subspace structure S 2 and MMIVA by using MSIVA S 2 sources to predict MMIVA sources, as well as using matched MMIVA sources to predict MSIVA S 2 sources.We find that every two MSIVA S 2 sources from each subspace can predict more than two MMIVA sources, while every two matched MMIVA sources can also predict more than two MSIVA S 2 sources (see Appendix A).Thus, there is no one-to-one mapping between MSIVA S 2 sources and MMIVA sources.MSIVA and MMIVA estimate sources in different ways.
It is important to acknowledge limitations of our current work.The first limitation is the subspace structure used in MSIVA.MSIVA selects the subspace structure that best fits the data from a predefined set, according to the loss value.However, it is not efficient to exhaustively evaluate potential subspace structures and corresponding performance metrics.Additionally, we make two assumptions the subspace structure that the linked subspaces share the same size across modalities, and that the modality-specific subspaces are one-dimensional.However, it is possible that these assumptions may not represent the underlying data structure.Ideally, we want to learn the underlying subspace structure from the data in an unsupervised manner.The second limitation is the linear mixing assumption in MSIVA.MSIVA assumes that the data of each modality can be transformed to linearly mixed sources, but the true mixing process can be nonlinear.Also, individual differences, site effects and preprocessing-related effects may introduce nonlinear artifacts, which are not taken into account in MSIVA.Future work will include applying data-driven subspace structures such as NeuroMark template [20], and developing nonlinear unsupervised learning algorithms to automatically estimate the optimal subspace structure.

Conclusions
Our proposed approach MSIVA can capture linked and modality-specific sources on both synthetic and neuroimaging datasets and yield a lower loss value compared to the unimodal baseline.The sources in the linked subspaces are strongly associated with age, sex and psychosis according to brain-phenotype modeling.Our results support that MSIVA can be applied to capture linked and unique imaging biomarkers from large multimodal neuroimaging datasets.

Figure 1 :
Figure 1: Five plausible candidate subspace structures (S 1 − S 5 ) for two modalities (M 1 − M 2 ).The top row shows the subspace structures for the first modality (M 1 ) while the bottom row shows the subspace structures for the second modality (M 2 ).Each panel depicts the idealized association between sources within each modality, across five different plausible scenarios.The block size represents the number of linked sources within modality (i.e., the subspace size).Same-color blocks are linked across modalities.The 1 × 1 blocks indicate modality-specific sources, except S 5 , whose sources are one-to-one linked across two modalities.

S 1 S 2 S 3 S 4 Figure 3 :
Figure 3: Synthetic data results: ISI (the lower the better).The unimodal initialization (PCA+ICA) and the MSIVA initialization (MGPCA+ICA) lead to the lowest ISI values (≤ 0.02) along the main diagonal, indicating that these two approaches can identify the correct ground-truth subspace structures from the incorrect ones.However, the multimodal initialization (MGPCA+GICA) fails at detecting the subspace structure S 4 with a high ISI value (0.07) in the main diagonal.Thus, MSIVA and the unimodal baseline are considered better than the multimodal baseline.

Figure 4 :
Figure 4: Synthetic data results: Interference matrices corresponding to diagonal ISI values in Figure 3. Correct subspace structures are identified and aligned across two modalities, following the ground-truth simulation design, except that the multimodal initialization (MGPCA+GICA) fails at reconstructing S 1 and S 4 .

Figure 7 :
Figure 7: MSIVA S 2 linked UKB dataset sources, color coded by age and sex.Rows I and II show aging effect (warm color: older group; cold color: younger group).Rows III and IV show sex effect (blue: male; red: female).M 1 : sMRI GM, M 2 : fMRI ALFF.In particular, subspaces 1, 3, 4 and 5 are associated with aging (especially cross-modal source 9 in subspace 5), while subspaces 2 and 4 show sex effect (especially cross-modal source 7 in subspace 4).

Figure 8 :
Figure 8: MSIVA S 2 linked patient dataset sources, color coded by age and diagnosis labels.Rows I and II show aging effect (warm color: older group; cold color: younger group).Rows III and IV show psychosis effect (light blue: controls; orange: schizophrenia patients).M 1 : sMRI GM, M 2 : fMRI ALFF.In particular, subspaces 2 and 5 are associated with aging (especially cross-modal source 9 in subspace 5) and SZ-related effects (especially cross-modal source 4 in subspace 2).

Table 3 :
Phenotype prediction performance using MSIVA S 2 linked subspaces.For the UKB dataset, sources from subspaces 5 and 4 yield the best age regression and sex classification performance, respectively.For the patient dataset, sources from subspaces 5 and 2 yield the best age regression and SZ classification performance, respectively.These linked sources from MSIVA S 2 are strongly associated with age, sex, and SZ-related effects.