Abstract
Resting state functional magnetic resonance imaging (rs-fMRI) data exhibits complex but structured patterns. However, the underlying origins are unclear and entangled in rs-fMRI data. Here we establish a variational auto-encoder, as a generative model trainable with unsupervised learning, to disentangle the unknown sources of rs-fMRI activity. After being trained with large data from the Human Connectome Project, the model has learned to represent and generate patterns of cortical activity and connectivity using latent variables. Of the latent representation, its distribution reveals overlapping functional networks, and its geometry is unique to each individual. Our results support the functional opposition between the default mode network and the task-positive network, while such opposition is asymmetric and non-stationary. Correlations between latent variables, rather than cortical connectivity, can be used as a more reliable feature to accurately identify subjects from a large group, even if only a short period of data is available per subject.
INTRODUCTION
The brain is active even at rest, showing complex activity patterns measurable with resting state fMRI (rs-fMRI)1. It is widely recognized that rs-fMRI activity is shaped by how the brain is wired, or the brain connectome2. Inter-regional correlations of rs-fMRI activity are often used to report functional connectivity3 and map brain networks for individuals4 or populations in various behavioral5 or disease states6. However, it remains largely unclear where rs-fMRI activity comes from7, 8, whereas understanding the underlying origins is critical to interpretation of any rs-fMRI pattern or dynamics9.
Prior findings suggest a multitude of sources (or causes) for rs-fMRI activity10, including but not limited to fluctuations in neurophysiology11, arousal12, unconstrained cognition13, non-neuronal physiology14, head motion15 etc. These sources only partially account for rs-fMRI activity and may be entangled not only among themselves but also with other sources that are left out simply because they are hard to specify or probe in the task-free resting state7. An inclusive study would benefit from using a data-driven approach to uncover and disentangle all plausible but hidden sources from rs-fMRI data itself, without having to presume the sources to whatever are accessible for empirical observations. To be effective, such an approach should be able to infer sources from rs-fMRI data and generate new rs-fMRI data from sources, while being able to account for complex and nonlinear relationships between the sources and the data.
These requirements lead us to deep learning, or representation learning with deep neural networks16. In addition to its success in artificial intelligence, deep learning has also been increasingly applied to brain research17. Despite its great potential18–20, deep learning applied to resting state fMRI analysis has arguably limited progress relative to what is attainable with conventional and simpler methods21. A challenge is inherent to the absence of any task in the resting state as well as the lack of sufficient knowledge usable for training deep neural networks with supervised learning.
To mitigate this challenge, we chose to use Variational Auto-Encoder (VAE)22, 23, a type of deep learning model, for unsupervised learning of the ever-increasing “big data” in rs-fMRI. Briefly, we designed and trained a VAE model to represent rs-fMRI data in terms of its hidden (or latent) sources and tested its ability to explain and generate rs-fMRI data. We also explored the functional organization of rs-fMRI data in the latent space to reveal network interactions in the brain. Lastly, we tested the utility of this model for identifying individuals from their rs-fMRI data4, as a starting example of its applications.
Results
VAE compressed rs-fMRI maps
Inspired by its success in artificial intelligence22, 23, we designed a VAE model in order to disentangle the generative factors underlying rs-fMRI activity. The model used a pair of convolutional and deconvolutional neural networks in an encoder-decoder architecture (Figure 1.b). The encoder transformed any rs-fMRI pattern, formatted as an image on a regular 2D grid (Figure 1.a), to the probability distributions of 256 independent latent variables. The decoder used samples of the latent variables to reconstruct or generate an fMRI map. Using data from HCP (WU-Minn HCP Quarter 2)24, we first trained the model with rs-fMRI maps from 100 subjects and then tested it with rs-fMRI data from 500 other subjects.
After being trained, the model could compress any fMRI map to a low-dimensional latent space and restore the map from the latent representation separately for every time point (Figure 1.c). Such compression resulted in spatial blurring comparable to the effect of spatial smoothing with 4mm full width at half maximum or the effect of linear dimension reduction with principal component analysis (Supplementary Figure 1). As such, the latent representation obtained with VAE preserved the spatiotemporal characteristics of rs-fMRI, despite modest but acceptable loss in spatial resolution and specificity.
VAE synthesized correlated fMRI activity
We asked whether the decoder in the VAE, as a generative model, could have learned the putative mechanisms by which rs-fMRI activity patterns arise presumably from brain networks. To address this question, we randomly sampled every latent variable from a standard normal distribution and used the decoder to synthesize 12,000 rs-fMRI maps. We calculated the seed-based correlations3 by using the VAE-synthesized data and compared the results with those obtained with length-matched rs-fMRI data concatenated across 10 subjects. Figure 2 shows three examples with the seed region in the primary visual cortex (V1), intraparietal sulcus (IPS), or posterior cingulate cortex (PCC). Both the synthesized and measured data gave rise to similar network patterns (mean±std of z-transformed spatial correlation z = 0.81±0.08, 0.97±0.07, or 0.88±0.05), consistent with early visual network, dorsal attention network, and default mode network reported in prior studies (e.g. by Yeo et al.25). Thus, the VAE provided a computational account for the generative process of resting state activity and could synthesize realistic rs-fMRI activity patterns and preserve inter-regional correlations as are observable in experiments.
Clusters in latent space
We further explored the utility of VAE for data-driven discovery of brain networks. We used the VAE to encode the rs-fMRI pattern observed at every time point from 500 subjects, clustered the time points by applying k-means clustering (k=21) to the low-dimensional latent representations, and decoded the cluster centroids to corresponding cortical maps. Each of the resulting maps represented a characteristic pattern of network interaction (see all 21 maps in Supplementary Figure 2).
Among the 21 clusters, 5 clusters (Cluster 5, 6, 8, 16, 19) showed activity increase (positive) at one or multiple regions in the default mode network26–28, alongside activity decrease (negative) at other regions (Figure 3.a). Both the positive and negative regions showed a varying degree of overlapping across the 5 clusters. The overlapping positivity highlighted the default mode network and revealed sub-divisions of its constituent regions29. The overlapping negativity showed the networks presumably involved in attention30, cognitive or executive control31–33. Similarly, we found 5 clusters with activity increase in the so-called frontoparietal control network31 (Cluster 10), cingulo-opercular network33 (Cluster 4 and 14), cognitive control network32 (Cluster 17), and dorsal attention networks34 (Cluster 1) – collectively referred to as “the task positive network”35 hereafter (Figure 3.b). These 5 clusters were partially overlapping with respect to their positive regions but varied from one another with respect to their negative regions, while some of them showed either no or little activity decrease. The overlapping positivity and negativity showed strong co-activation of the task positive network alongside weak deactivation of the default mode network. These results indicate patterns of opposition between the default mode network and the task positive network, conceptually similar to the notion of “anti-correlation”35. Interestingly, the opposition was asymmetric, being more pronounced when activity increases in the default mode network, but much weakened when activity increases in the task positive network.
In addition, the other clusters were also informative (Supplementary Figure 2). To name a few examples, Cluster 21 showed activity decrease in the whole brain, thereby a signature of global signal fluctuation. Cluster 13 and 15 showed widespread synchrony across sensory systems. Cluster 7 and 9 showed the networks for sensorimotor control of the limbs and of the mouth, pharynx, and visceral organs, respectively. Whereas most clusters were bilaterally symmetric, Cluster 2 and 20 were unilateral to the right and left prefrontal cortex, respectively. Common to many clusters was the fact that a cluster could highlight the positive interactions among a set of well-defined cortical regions alongside their negative interactions with a different set of regions. These results demonstrate that VAE enables data-driven discovery of overlapping and interacting networks for functional integration, as opposed to networks that limit themselves to anatomical and functional segregation.
Individual identification
We further asked whether functional connectivity (FC) in the latent space could be used as a feature or “fingerprint” for identifying individuals in a population4, 36. We calculated the correlation between every pair of latent variables, assembled the pair-wise FC into a FC profile, and evaluated its similarity between two separate sessions within or between subjects. For comparison, we performed similar analyses by evaluating FC between 360 cortical areas in an existing atlas37. As shown in Figure 4.a, FC between any pair of cortical areas was mostly positive (mean ± std of z-transformed correlation: z=0.26±0.3) and highly reproducible not only within the same subject (r=0.66) but also between different subjects (r=0.45). On the other hand, FC between latent variables had both positive and negative values (z=0.00±0.14) and its reproducibility was high only within the same subject (r=0.32) but not between different subjects (r=0.08). Although less reproducible, the FC profile was more distinctive across subjects when it was evaluated between latent variables rather than cortical areas (Figure 4.b). In the latent space, the FC profile was significantly more consistent within a subject than between subjects (two-sample t-test, t(249,998)=235.81, two-sided p<0.001). The distribution of within-subject correlations was in nearly complete separation from that of between-subject correlations (Figure 4.b, bottom).
Then we compared the performance of individual identification on the basis of the FC profile in the latent vs. cortical space. To identify 1 out of 500 subjects, we compared a target subject’s FC profile in the 1st session with every subject’s FC profile in the 2nd session and chose the best match in terms of Pearson correlation coefficient. As such, the choice was correct if the correlation with the target subject was higher than the largest correlation with any non-target subject. We found that the FC profile in the cortical space could support 69.3% top-1 accuracy while identification was often done with marginal confidence relative to the decision boundary (Figure 4.c). Using the FC in the latent space allowed us to reach 97.5% top-1 accuracy. The evidence for correct identification was apparent with a large margin from the decision boundary (Figure 4.d). Moreover, the use of FC in the latent space supported reliable and robust performance in top-1 identification given an increasingly larger population (Figure 4.e) or when the data were limited to a short duration (Figure 4.f), being notably superior to the use of FC in the cortical space.
Discussion
Here, we present a method for unsupervised representation learning of cortical rs-fMRI activity. Our results suggest that this method is able to disentangle generative factors underlying spontaneous brain activity, discover overlapping brain networks with opposing or associated functions, and capture individual characteristics or variation. We expect this method to be a valuable addition to the existing tools for investigating the origins of resting state activity, mapping functional brain networks, and potentially supporting individualized prediction of disease phenotypes and progression. Next, we discuss our findings from the joint perspective of methodology, neuroscience, and applications.
VAE is trainable with unsupervised learning22, 23 (without any label), which is appealing for learning representations of rs-fMRI data. Since rs-fMRI measures spontaneous brain activity unconstrained by any task, labels as required for supervised learning are either unavailable or far fewer than the data itself. Unsupervised learning with VAE can leverage the ever-increasing amount of rs-fMRI data24. The latent representations extracted from VAE can serve as the input to other algorithms to further support more specific goals such as classification of brain disorders and prediction of their phenotypes38, 39.
The method herein can be extended in multiple ways. Although it is trained with rs-fMRI data, we hypothesize that the VAE model can encode and decode both rs-fMRI and task-fMRI data but with different latent distributions. If this is true, one may use this model to classify different perceptual, behavioral, or cognitive states and to reveal the distinctive network interactions underlying various states40. The fact that the VAE can synthesize new data (Figure 2) is also appealing. It can be used as a post-processing strategy for data augmentation and interpolation, when data is short or corrupted, of interest for evaluation of dynamic functional connectivity41, 42 and correction of head motion15. It also supports the notion that the learned latent space captures the origins of rs-fMRI and the VAE decoder captures the computational account for how rs-fMRI arises from its origins.
It is worth mentioning two limitations of the VAE model in its current form. First, the model focuses on cortical patterns but excludes sub-cortical and white-matter voxels. This design is not only for the ease of model implementation but also for the predominant role of the neocortex in brain functions43. However, this precludes the model from accounting for subcortical networks or their interactions with the cortex. Addressing this limitation awaits future studies to redesign the model as a 3-D neural network that takes volumetric fMRI data as the input. Second, the VAE model only represents spatial patterns but ignores temporal dynamics inherent to rs-fMRI data. Modeling the temporal dynamics is desirable but non-trivial, since it is highly irregular, complex and variable. To fill this gap, we direct future studies to designing a recurrent neural network19, 44, as an add-on to VAE, for sequence learning based on spatial representations extracted from individual time points.
VAE provides a new tool for mapping overlapping functional networks in the brain. A brain region may be involved in multiple networks each supporting a distinctive function45, 46. However, existing network analyses still tend to group brain regions into non-overlapping networks25. VAE allows us to discover overlapping networks as clusters in the latent space spanned by independent latent variables. As such, VAE is conceptually similar to temporal ICA45 but allows for nonlinear relationships between latent variables and the input data they represent47. Arguably, finding clusters in the low-dimensional latent space is more desirable than doing so in the higher-dimensional voxel space48. Not only is it more computationally efficient, but data representations are also more disentangled in the latent space than in the voxel space to readily reveal the underlying organization, as discussed later.
Clusters in the latent space do not manifest themselves as resting state networks25 per se but highlight interactions among those networks. Many of the clusters cover more regions and/or reveal finer divisions within regions than are commonly observed in resting state networks (Figure 3). In each cluster, the interactions among its constituent regions should not be interpreted pairwise (e.g. correlation) but as two multivariate modes: co-activation and co-deactivation, which we interpret as the signatures of functional association and opposition, respectively.
Our results suggest the functional opposition between regions in the default mode network and those in cognitive control networks. This finding agrees with the prior finding that attention demanding tasks tend to increase activity in cognitive control networks (also referred to as the task positive network35) and decrease activity in the default mode network26. It may sound a reminiscence of the anti-correlation between the task positive network and the default mode network35. However, the anti-correlation is controversial and confounded by global signal regression49 – a questionable preprocessing step that causes spuriously negative correlations50. Note that global signal regression was not used and thereby not of concern in this study. Our finding provided complementary evidence, supporting a similar but revised view as anti-correlation35. We conclude that the functional opposition between the default mode network and the task positive network is indeed real but non-stationary41, 46. It occurs at some but not all times. It is also asymmetric in that activity increase in the default mode network tends to co-occur with activity decrease in the task positive network, whereas activity increase in the task positive network unnecessarily or less frequently co-occurs with activity decrease in the default mode network. Interestingly, the global signal fluctuation is also non-stationary and identifiable as a different cluster in the latent space. Together, the functional opposition and the global signal are separable in time; therefore, the latter does not necessarily invalidate or confound the former.
Central to this study is the efficacy of using VAE to disentangle what causes resting state activity. In the VAE model, the sources are the latent variables; the decoder describes how the sources generate the observed activity; the encoder models the inverse inference of the sources from the activity. Since the latent variables are data-driven, it is currently unclear how to interpret them as specific physiological processes, many of which are not observable. Nevertheless, we expect the latent variables extracted by VAE to provide the computational basis for further understanding the origins of resting state activity. We hypothesize that the truly disentangled physiological origins, whether observable or not, are individually describable as the latent variables up to linear and sparse projection. This hypothesis awaits confirmation by future studies.
In the latent space, functional connectivity describes the correlations among the disentangled sources of resting state activity. This is a new perspective different from the functional connectivity among observable voxels, regions or networks3, 25. If the VAE model has fully disentangled the sources in a population level, functional connectivity should be near zero between different latent variables. In other words, the model sets a nearly null baseline such that the latent-space functional connectivity primarily reflects features unique to individuals. Supporting this notion, our results suggest the use of functional connectivity in the latent space leads to a significantly improved accuracy, robustness, and efficiency in individual identification, compared to the use of functional connectivity among cortical parcels4, 36. Note that our main purpose is not to push for a higher identification accuracy but to understand the distribution and geometry of data representations in the feature space. Therefore, we opt for minimal preprocessing and the simplest strategy for individual identification. There is room for methodological development to further improve the identification accuracy or to extend it for many other tasks, including classification of the gender or disease states, prediction of behavioral and cognitive performances, to name a few examples. We expect that such applications would be fruitful and potentially impactful to cognitive sciences and clinical applications.
Methods
Data
We used rs-fMRI data from 602 healthy subjects randomly chosen from the Q2 release by HCP24. For each subject, we used two sessions of rs-fMRI data acquired from different days with either right-to-left or left-to-right phase encoding. Each session included 1,200 time points separated by 0.72s. Following minimal preprocessing51, we applied voxel-wise detrending (regressing out a 3rd-order polynomial function), bandpass filtering (from 0.01 to 0.1 Hz), and normalization (to zero mean and unitary variance). We further separated the data into three sets, including 100, 2, or 500 subjects for training, validating, or testing the VAE model, respectively.
Geometric reformatting
We converted the rs-fMRI data from 3-D cortical surfaces to 2-D grids in order to structure the rs-fMRI pattern as an image to ease the application of convolutional neural networks. As illustrated in Figure 1.a, we inflated each hemisphere to a sphere by using FreeSurfer52. For each location on the spherical surface, we used cart2sph.m in MATLAB to convert its cartesian coordinates (x, y, z) to spherical coordinates (a, e) reporting the azimuth and elevation angles in a range from −π to π and from −π/2 to π/2, respectively. We defined a 192×192 grid to resample the spherical surface with respect to azimuth and sin(elevation) such that the sampled locations were uniformly distributed at approximation (Supplementary Figure 3). We used the nearest-neighbor interpolation to convert data from the 3-D surface to the 2-D grid, and vice versa.
Variational autoencoder
We designed a β-VAE model23, a variation of VAE22, to learn representations of rs-fMRI spatial patterns. This model included an encoder and a decoder (Figure 1.b). The encoder converted an fMRI map to a probabilistic distribution of 256 latent variables. The decoder sampled the latent distribution to reconstruct the input fMRI map or generate a new map. The encoder stacked five convolutional layers and one fully connected layer. Every convolutional layer applied linear convolution and rectified its output53. The 1st layer applied 8×8 convolution separately to the input from each hemisphere and concatenated its output. The 2nd through 5th layers applied 4×4 convolution. The fully connected layer applied linear weighting and yielded the mean and standard deviation that described the normal distribution of each latent variable. The decoder used nearly the same architecture as the encoder but connected the layers in the reverse order for transformation from the latent space to the input space. See Figure 1.b for more details about the architecture.
We trained the VAE model to reconstruct input while constraining the distribution of every latent variable to be close to an independent and standard normal distribution. Specifically, using the training data, we optimized the encoding parameters, ϕ, and the decoding parameters, θ, to minimize the loss function as below. where x is the input data combined across the left and right hemispheres, x′ is the corresponding output from the model, is the posterior normal distribution of the latent variables, z, with their mean and standard deviation denoted as μz and σz, is an independent and standard normal distribution as the prior distribution of the latent variables, DKL measures the Kullback-Leibler divergence between the posterior and prior distributions, and β is the hyperparameter balancing the two terms in the loss function. We optimized the model by using stochastic gradient descent (batch size=128, learning rate=10−5, and 500 epochs) and Adam optimizer54 implemented in PyTorch (v1.2.0). We explored four values (1, 2, 5, 10) for β and chose β = 5 to disentangle the latent variables while minimizing the loss function in training and validation (Supplementary Figure 4).
Synthesizing rs-fMRI functional connectivity
We used the trained VAE to synthesize rs-fMRI data from random samples of latent variables. To synthesize a vector in the latent space, we drew a random sample of every latent variable independently from a standard normal distribution. The synthesized vector passed through the decoder in VAE, generating a cortical pattern. Repeating this process, we synthesized 12,000 cortical patterns as data used for seed-based correlation analysis. As examples, we explored three seed locations within V1, IPS, and PCC and calculated the functional connectivity to each seed based on the Pearson correlation coefficient. The MNI coordinates of the seed in V1, IPS, and PCC were (7, −83, 2), (26, −66, 48), and (0, 57, 27), respectively55. For comparison, we evaluated seed-based correlations with length-matched experimental rs-fMRI data concatenated across 10 subjects in HCP. We evaluated the reproducibility of the results by repeating the above analysis 20 times with different synthesized data and the experimental data from different subsets of subjects.
Clustering in the latent space
We encoded the rs-fMRI spatial pattern at every time point for 500 testing subjects, yielding 600,000 vectors in the latent space. We used k-means clustering (with Euclidean distance) to group those vectors to 21 clusters. The choice of k=21 was made empirically in part to be consistent to a prior study with a similar motivation45 and in part to fall within the range of the number of resting state networks reported in literature. For each of the 21 clusters, the cluster centroid was calculated and converted to a corresponding cortical pattern by using the VAE’s decoder; the resulting cortical pattern was scaled such that its maximal absolute value equaled 1.
To evaluate the spatial overlap among clusters, we thresholded the cortical pattern resulting from each cluster by >0.35 (for positivity) or <−0.35 (for negativity). For clusters relevant to the default mode network (5, 19, 8, 6, 16) or the task positive network (17, 1, 14, 4, 10), we calculated the overlapping positivity (or negativity) by counting the number of times that each cortical location was over (or below) 0.35 (or −0.35)
Individual identification
In the testing data set, every individual had rs-fMRI data acquired for two separate sessions. For each session, we encoded the data as (256×1,200) latent representations, calculated the z-transformed correlation between every pair of latent variables, and stored the z-values into a vector, referred to as the FC profile in the latent space.
We tested the utility of this FC profile as the feature for identifying individuals in a population (n=500). For every subject, we used the FC profile collected in one session as the subject-identifying key in a database. Given this database, we tested the accuracy of retrieving any subject’s identity by using a query based on the subject’s FC profile in the other session. To retrieve the identity, we compared the query to every key to find the best match in terms of the highest correlation. We evaluated the identification accuracy as the percentage by which the correct identity was retrieved. Since we could use either session 1 or session 2 for the key while using the other for the query, we tested both cases and averaged the identification accuracy.
For comparison, we also evaluated the functional connectivity between every pair of 360 cortical parcels defined in an established atlas37. Similarly, we used the FC profile in the cortical space as the feature for individual identification and compared the resulting identification accuracy with that based on the FC profile in the latent space. We repeated this comparative evaluation with a varying population size (from n=5 to 500) or a varying length of data (from 9 to 180 s). We repeated the above analysis 100 times, each time with a different subset of the testing data and averaged the identification accuracy across the repeated tests.