## Abstract

Technological advances have enabled the joint analysis of multiple molecular layers at single cell resolution. At the same time, increased experimental throughput has facilitated the study of larger numbers of experimental conditions. While methods for analysing single-cell data that model the resulting structure of either of these dimensions are beginning to emerge, current methods do not account for complex experimental designs that include both multiple views (modalities or assays) and groups (conditions or experiments). Here we present Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of structured single cell multi-modal data. MOFA+ builds upon a Bayesian Factor Analysis framework combined with fast GPU-accelerated stochastic variational inference. Similar to existing factor models, MOFA+ allows for interpreting variation in single-cell datasets by pooling information across cells and features to reconstruct a low-dimensional representation of the data. Uniquely, the model supports flexible group-level sparsity constraints that allow joint modelling of variation across multiple groups and views.

To illustrate MOFA+, we applied it to single-cell data sets of different scales and designs, demonstrating practical advantages when analyzing datasets with complex group and/or view structure. In a multi-omics analysis of mouse gastrulation this joint modelling reveals coordinated changes between gene expression and epigenetic variation associated with cell fate commitment.

## Introduction

Single-cell methods have provided unprecedented opportunities to assay cellular heterogeneity. This is particularly important for studying complex biological processes, including the immune system, embryonic development and cancer^{1–4}.

Following the establishment of the first scalable single-cell RNA sequencing (scRNA-seq) methods, other molecular layers are increasingly receiving attention, including single-cell assays for DNA methylation^{5–9} and chromatin accessibility^{10–12}. More recently, technological advances enabled multiple biological layers to be probed in parallel in the same cells^{12,13}, including: single-cell genome and transcriptome (G&T-seq)^{14}, single-cell DNA methylation and transcriptome (scM&T-seq)^{15}, single-cell chromatin accessibility and transcriptome (sci-CAR)^{16} and single-cell Nucleosome, Transcriptome and Methylation (scNMT-seq)^{17}, among others^{18–24}. These experimental techniques provide the basis for studying regulatory dependencies between transcriptomic and (epi)-genetic diversity at the single-cell level.

However, from a computational perspective, the integration of single-cell assays remains challenging owing to high degrees of missing data, inherent assay noise, and the scale of modern single-cell datasets, which can potentially span millions of cells. Previously, we introduced Multi-Omics Factor Analysis (MOFA), a statistical framework that addresses some of these challenges. However, the inference framework of MOFA is not designed to cope with increasingly-large scale datasets. Moreover, while MOFA is already devised to account for multiple views, the model makes strong assumptions about the dependencies across cells and in particular cannot account for additional structure between cells, e.g. batch, donors or conditions. By pooling and contrasting information across studies or experimental conditions, it would be possible to obtain more comprehensive insights into the complexity underlying biological systems^{25–29}.

Other methods that have been proposed for integrating different data modalities include Seurat (v3) and LIGER, two strategies based on dimensionality reduction and manifold alignment^{30,31}. Both methods anchor independent datasets from related populations of cells by exploiting the existence of a common feature space (for example matching gene expression and corresponding promoter accessibility). MOFA+, in contrast, is aimed at a different problem and allows for integrating data modalities via a common sample space (i.e. measurements derived from the same set of cells), where the features may be distinct across views.

## Model description

In a previous study, we developed Multi-Omics Factor Analysis (MOFA), a statistical framework for the integrative analysis of multiple data modalities^{32}. Using a Bayesian Group Factor Analysis framework, MOFA infers a low-dimensional representation of the data in terms of a small number of (latent) factors that capture the global sources of variability (**Figure 1a**). While the model is applicable to single-cell assays, MOFA and related factor models have critical limitations, including their scalability and simplistic assumptions on the structure of the data. In particular, these models do not provide a principled approach for integrating multiple groups and views within the same inference framework.

In MOFA+ we address these challenges by i) developing a stochastic variational inference framework amenable to GPU computations, enabling inference with potentially millions of cells and ii) incorporating priors for flexible, structure regularisation, thus enabling joint modeling of groups of samples and multiple views.

Briefly, the inputs to MOFA+ are multiple datasets where features have been aggregated into non-overlapping sets of views and where cells have been aggregated into non-overlapping sets of groups (**Figure 1a**). Views generally correspond to different data modalities or omics (i.e. RNA expression, DNA methylation and chromatin accessibility), and groups to different experiments, batches or conditions. Importantly, the probabilistic framework underlying MOFA+ naturally handles missing values. During model training, MOFA+ infers K latent factors (per group) with associated feature weight matrices (per view) that explain the major axes of variation across the datasets. Importantly, the model provides sparsity-inducing priors to account for the structure of the data and to encourage sparse solutions to deliver interpretable solutions. After training, the model output enables a wide range of downstream analyses (**Figure 1b**), including variance decomposition, inspection of feature weights, inference of differentiation trajectories, and clustering, among others.

For technical details and mathematical derivations, we refer the reader to the **Methods** and the **Appendix.** A comparison of the model features with other factor analysis models is provided in **Table S1**.

## Model validation using simulated data

First, we validated the new features of MOFA+ using simulated data drawn from its generative model. The simulated data represented a range of dataset sizes with differing numbers of views and groups.

First, to assess the utility of stochastic variational inference, we trained different models either using conventional (deterministic) variational inference (VI), or using stochastic variational inference (SVI). Across a wide range of hyperparameter settings (see **Methods**) we observed that SVI yields Evidence Lower Bounds (i.e., values of the objective function) that are consistent with those obtained from conventional inference (**Figure S1**). However, the GPU-accelerated SVI implementation in MOFA+ achieved up to a ~20 fold increase in speed compared to VI, with the most dramatic speedups observed for large datasets. This inference scheme facilitates the application of MOFA+ to datasets comprising hundreds of thousands of cells using commodity hardware (**Figure S2**).

Next, we assessed the group-wise sparsity prior, by assessing to what extent it facilitates the identification of factors with simultaneous differential activity between groups and views. Indeed, when simulating data where factors explain differing amounts of variance across groups and across views, MOFA+ was able to more accurately reconstruct the true factor activity patterns than MOFA v1 or standard Bayesian Factor analysis (**Figure S3**).

## Integration of a heterogeneous time-course single-cell RNA-seq dataset

To illustrate the ability of MOFA+ to model multiple groups, we considered a time course scRNA-seq dataset (a single view), consisting of 16,152 cells that were isolated from multiple mouse embryos at embryonic days (E) 6.5, E7.0 and E7.25 (two biological replicates per stage). In this dataset, embryos are expected to contain similar subpopulations of cells but also to exhibit transcriptional differences due to variation in the rate of the developmental progression. As a proof of principle, we used MOFA+ to disentangle stage-specific variation from variation that is shared across all samples. For this purpose, we considered the six batches of cells (two replicates for each of the three embryonic stages) as different groups in the MOFA+ model.

MOFA+ identified 10 robust factors (Methods, **Figure S4**), capturing between 35% and 55% of the total transcriptional variance per embryo (**Figure 2a**). Some factors recapitulate the existence of post-implantation developmental cell types, including extra-embryonic (ExE) cell types (Factor 1 and Factor 2) and the transition of epiblast cells to nascent mesoderm via a primitive streak transcriptional state (Factor 4; **Figure 2b-c and Figure S5**). Consistently, the top weights for these factors are enriched for lineage-specific gene expression markers, including *Ttr* and *Apoa1* for ExE endoderm, *Rhox5* and *Bex3* for ExE ectoderm, and *Mespi* and *Phlda2* for nascent mesoderm^{33}. Other factors correspond to the formation of mesoderm-derived subpopulations that emerge from the caudal primitive streak after E7.0, including mesenchymal cells (Factor 9, **Figure S6**). MOFA+ also detects technical variation due to metabolic stress that affects all batches in a similar fashion (Factor 3, **Figure S7**).

When inspecting the factor activity across development, we observe that the percentage of variance explained by Factor 1 is not correlated with developmental progression, indicating that commitment to ExE endoderm fate occurs early in the embryo and that the proportion of this cell type remains relatively constant from E6.5 to E7.25. In contrast, the amount of variance explained by Factor 4 increases over time (**Figure 2d**), consistent with a higher proportion of cells committing to mesoderm after ingression through the primitive streak.

All together, this application shows how MOFA+ can identify biologically meaningful structure in scRNA-seq datasets with multiple groups. Interpretability is achieved at the expense of limited information content per factor (due to the linearity assumption). Nevertheless, the MOFA factors can be used as input to other methods that learn compact nonlinear manifolds that discriminate cell types (**Figure 2e**) and enable the reconstruction of pseudotime trajectories^{34,35}.

## Identification of context-dependent methylation signatures associated with cellular diversity in the mammalian cortex

As a second use case, we considered how MOFA+ can be used to investigate variation in epigenetic signatures between populations of neurons. This application illustrates how a multi-group and multi-view structure can be defined from seemingly uni-modal data, which can then be used to test specific biological hypotheses.

We analysed 3,069 cells isolated from the frontal cortex of young adult mice, where DNA methylation was profiled using single-cell bisulfite sequencing^{7}. Recent studies have demonstrated that neurons contain significant levels of non-CpG methylation (mCH), an epigenetic mark that has been historically dismissed as a methodological artifact of incomplete bisulfite conversion^{36–39}.

Here we used MOFA+ to dissect the degree of coordination between mCH and mCG signatures in different regions of the brain. As input data we quantified mCH and mCG levels at gene bodies, promoters and putative enhancer elements (Methods). Each combination of genomic and sequence context (e.g., mCG at enhancer elements) was defined as a separate view. To explore the influence of the neuron’s location we grouped cells according to their cortical layer: Deep, Middle or Superficial (**Figure S8**).

The sparseness of DNA methylation data results in large amounts of missing values, which hampers the use of conventional dimensionality reduction techniques such as PCA or NMF^{34,35,40}. By contrast, the probabilistic framework underlying MOFA+ naturally accounts for missing values^{32}.

MOFA+ identifies 5 robust factors (Methods; **Figure S9**) that explain the structured heterogeneity across genomic contexts and cortical layers. Factor 1, the major source of variation, is linked to the existence of inhibitory and excitatory neurons. This factor shows significant mCG activity across all cortical layers, mostly driven by coordinated changes in enhancer elements, but to some extent also gene bodies (**Figure 3a,b**). Consistently, the top weights in the mCG:gene body view are enriched for genes whose RNA expression has been shown to discriminate between the two classes of neurons, including *Neurod6* and *Nrgn ^{7}*. In addition, this analysis identifies novel genes with differential gene body mCG levels that may have yet unknown roles in defining the epigenetic landscape of neuronal diversity, including

*Vsig2, Taar3*and

*Cort*(

**Figure S10**).

Factor 2 captures genome-wide differences in global mCH levels (R=0.99), which is moderately correlated with differences in global mCG levels (R=0.32) (**Figure S11**). Factor 3 captures heterogeneity linked to the increased cellular diversity along cortical depth, with the Deep layer displaying significantly more diversity of excitatory cell types than the Superficial layer (**Figure 3a,c**). Again, we observe that the MOFA+ factors are a suitable input to learn non-linear manifolds and reveal the existence of subpopulations of both excitatory and inhibitory cell types (**Figure 3d**). Notably, the MOFA+ factors are significantly better at identifying subpopulations than the conventional approach of using Principal Component Analysis with imputed measurements (**Figure S12**).

Interestingly, in addition to the dominant mCG signal, MOFA+ connects Factor 1 and Factor 3 to variation in mCH, which could suggest a role of mCH in cellular diversity. We hypothesise that this can be supported if the genomic regions that show mCH signatures are different than the ones marked by the conventional mCG signatures. To investigate this, we correlated the mCH and mCG feature weights for each factor and genomic context (**Figure 3e and Figure S13**). In all cases we observe a strong positive dependency, indicating that mCH and mCG signatures are spatially correlated and target similar loci.

Taken together, this result supports the hypothesis that mCH and mCG tag the same genomic loci and are associated with the same sources of variation, suggesting that the presence of mCH may be the result of non-specific *de novo* methylation as a by-product of the establishment of mCG^{36}.

## MOFA+ reveals molecular signatures of lineage commitment during mammalian embryogenesis

As a final application, we considered a complex dataset with multiple sample groups and views. The dataset consists of a multi-omic atlas of mouse gastrulation where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of development^{41}. In this dataset MOFA+ provides a method for delineating coordinated variation between the transcriptome and the epigenome and for detecting at which stage(s) of development it occurs.

As input to the model we quantified DNA methylation and chromatin accessibility values over two sets of regulatory elements: gene promoters and enhancer elements (distal H3K27ac sites^{41–43}). RNA expression was quantified over protein-coding genes. After data processing (**Methods**), separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout. Cells were grouped according to their developmental stage (E5.5, E6.5 and E7.5), reflecting the underlying experimental design (**Figure S14**). Notably, the epigenetic readouts are extremely sparse, with, on average, only 18% and 10% of cells having recorded data at a gene promoter for DNA methylation and chromatin accessibility, respectively. In this context, methods that pool information across cells and features are essential for robust inference.

MOFA+ identifies 8 robust factors with a minimum variance explained of 1% in the gene expression view. The first factor captured the formation of ExE endoderm, a cell type that is present across all stages (**Figure 4a**), in agreement with our previous results using the independently generated transcriptomic atlas of mouse gastrulation (**Figure 2**). MOFA+ links Factor 1 to changes across all molecular layers. Notably, the distribution of weights for DNA methylation are skewed towards negative values (at both enhancers and promoters), indicating that ExE endoderm cells are characterised by a state of global demethylation, consistent with previous studies^{44}.

The next factors captured the molecular variation associated with the emergence of the primary germ layers at E7.5: mesoderm (Factor 2, **Figure 4b**), and embryonic endoderm (Factor 4, **Figure S15**). Again, for both factors, MOFA+ connects the transcriptome variation to changes in DNA methylation and chromatin accessibility. Yet, in striking contrast to Factor 1, the variance decomposition analysis and the distribution of weights indicate that the epigenetic dynamics are mostly driven by enhancer elements. Little coordinated variation is observed in promoters (**Figure 4b**), even for genes that show strong differential expression between germ layers (**Figure S16**). These results are in agreement with other studies that pinpointed distal regulatory elements as a major target of epigenetic modifications during embryogenesis^{45–47}.

The remaining factors capture variation that is mostly driven by the RNA expression, whose etiology can be related to the existence of morphogenic gradients (Factor 8, **Figure S17**), the emergence of other cellular subpopulations during gastrulation (Factor 7, **Figure S18**) and cell cycle (Factor 6, **Figure S19**).

In conclusion, the MOFA+ output suggests that independent cell fate commitment events undergo different modes of epigenetic variation. While some lineages manifest global changes in the epigenetic landscape (ExE endoderm, Factor 1), other cell types are associated with the emergence of local epigenetic patterns that are driven by specific genomic contexts (embryonic endoderm and mesoderm, Factors 2 and 4).

## Discussion

As single-cell technologies mature, they are applied to generate data sets of increasing complexity, with highly structured and sparse measurements^{16,17,24,48,49}. Consequently, there is a need for integrative computational frameworks that can robustly and systematically interrogate the data generated in order to reveal the underlying sources of variation^{25}.

In this study we introduced MOFA+, a generalisation of the MOFA framework^{32} that facilitates analysis of large-scale datasets with complex multi-group and/or multi-view experimental designs. From a technical perspective, MOFA+ provides two major features: first, GPU-accelerated stochastic variational inference ensures scalability to potentially millions of cells; second, structured sparsity priors provide a principled inference framework to jointly analyse multiple data sets. Additionally, MOFA+ inherits all the features from its predecessor, including a natural approach for handling missing values as well as the capacity to perform inference with non-Gaussian readouts^{32}.

Although MOFA+ represents an important step forward in the analysis of single-cell omics data, it has some limitations. First, it requires multi-modal measurements from the same set of cells. This contrasts with other integrative frameworks such as Seurat^{31} or LIGER^{30}, which anchor data sets based on the assumption of a common feature space (e.g. matching gene expression and promoter accessibility). Second, the model is not currently able to capture strong non-linear relationships (**Figure S20**). We speculate that this could be addressed by combining MOFA with concepts from variational autoencoders, as recently proposed for the analysis of scRNA-seq data^{50–52}. Third, the model assumes independence between features in its prior distributions, despite the fact that genomic features are known to interact via complex regulatory networks^{53}.

To facilitate adoption of the method, we deploy MOFA+ as open-source software with multiple tutorials and a web-based analysis workbench, to support for a large variety of downstream analysis, enabling a user-friendly in-depth characterisation of structured single-cell data.

## Methods

### Multi-Omics Factor Analysis v2 model (MOFA+)

The input to MOFA+ is a list of matrices, each matrix containing a predefined group of cells (group) and a predefined set of features (view, see Figure 1 for a visual representation).

We introduce the following notation: M for the number of views, D_{m} for the number of features in the *m*-th view, G for the number of groups, N_{g} for the number of samples in the *g*-th group and K for the number of factors.

As in the original version of MOFA^{32}, the underlying master equation is the standard matrix factorisation framework:

*Y*denotes the matrix of observations for the_{gm}*m*-th view and the*g*-th group.*W*denotes the weight matrix for the_{m}*m*-th view*Z*denotes the factor matrix for the_{g}*g*-th group*ε*denotes the residual noise for the_{gm}*m*-th view and the*g*-th group. The specific form of the noise can be tailored to the nature of the data type^{32}

The factor matrix *Z _{g}* has dimensionality (N

_{g},K) and contains the low-dimensional representation of the samples from the

*g*-th group. The weight matrix

*W*has dimensionality (D

_{m}_{m},K) and contains an association score for each feature with each factor. The noise matrix

*ε*contains the unexplained variance (i.e. noise) for each feature in each group.

_{gm}The model is formulated in a probabilistic Bayesian setting. We introduce prior distributions on all unobserved variables of the model in order to induce specific regularisation criteria, as described below.

### Interpretation of the factors

The MOFA+ Factors capture global sources of variability in the data. The factors matrices express how much the MOFA+ Factors are active within the various groups of cells. Mathematically, each factor orders cells along a one-dimensional axis centered at zero. Samples with different signs have opposite effects along the inferred axis of variation. Cells that remain centered at zero represent either an intermediate phenotype or no phenotype at all associated with the factor under consideration.

### Interpretation of the weights

The weights matrices provide a score for how strong each feature relates to each factor, hence allowing a biological interpretation of the MOFA+ Factors. Genes with no association with the factor have values close to zero, while genes with strong association with the factor have large absolute values. The sign of the loading indicates the direction of the effect: a positive loading indicates that the feature has higher levels in the cells with positive factor values, and vice versa.

### Model regularisation

The regularisation of the weights and the factors is critically important for enabling MOFA to perform inference with structured data sets.

In the original version of MOFA, structured priors were applied to the weights to enable inference and interpretable outputs of multi-view data sets (i.e. structured features but not samples were facilitated). In MOFA+ we generalised this by introducing a symmetric regularisation for both the factors and weights, hence accounting for structure in both the sample and the feature space (see appendix for mathematical details). The main purpose is to enable the identification of factors that are active in different subsets of both groups and views.

The first level of sparsity uses an Automatic Relevance Determination prior to explicitly model differential activity of factors across views and/or across groups. The second level of sparsity uses a spike-and-slab prior to simultaneously push individual weights and factors to zero. This effectively encourages sparse solutions where factors are (potentially) associated with a small number of active features and/or active within small subsets of samples. Using feature-wise sparsity priors helps disentangling technical and biological sources of variability^{54}.

### Noise model

MOFA+ supports a variety of different likelihood models to enable integration of diverse combinations of data types. These include a Gaussian noise model for continuous data, a Poisson model for count data and a Bernoulli model for binary data. This feature is inherited from MOFA^{32}.

### Statistical variational inference

In MOFA, inference was performed using mean-field variational Bayes^{55–57}. While this framework is typically faster than sampling-based Monte Carlo approaches, it becomes prohibitively slow when applied to large single-cell datasets. In MOFA+ we implemented a stochastic version of the algorithm^{57,58} that can be accelerated by performing computation using GPUs.

The use of stochastic variational inference comes at the cost of introducing additional hyperparameters: batch size (number of cells used to compute the gradients), learning rate (step size) and forgetting rates (rate of decay of the learning rate). While we find the hyperparameters to be robust across a variety of simulated data (**Figure S1**), their optimisation is likely to be important in some contexts. By default we use GPU-powered standard variational inference if the full data set fits in the GPU. Otherwise, we perform stochastic variational inference using a batch size of 50%, a learning rate of 0.5 and a forgetting rate of 0.25.

### Variance decomposition

Once the model is trained, we can quantify how much of the observed variance is explained by each factor *k* in each group *g* and in each view *m.* This is estimated using a coefficient of determination:

### Gene set enrichment analysis

Gene set enrichment analysis was performed using the Reactome gene sets^{59}. For every gene set G, we evaluate its significance via a parametric t-test, where we contrast the weights of the foreground set (features that belong to the set G) versus the background set (the weights of features that do not belong to the set G). Resulting P-values were adjusted for multiple testing for each factor using the Benjamini–Hochberg procedure^{60}. Significant enrichments were at a false discovery rate of 1%.

### 10x: data processing

The gastrulation scRNA-seq atlas was obtained from^{33} and is available in the Gene Expression Omnibus under accession GSE87038. Cells were subset to stages E6.5, E7.0 and E7.25. Cells from stage E6.75 were not included in the analysis because they consist of a single biological replicate. Gene expression counts were normalised using *scran ^{61}.* The 5,000 most overdispersed genes after regressing out the stage effect were selected prior to fit the model. Details on the quality control and data preprocessing can be found in

^{33}.

### Ecker: data processing

The mouse brain DNA methylation data was obtained from^{7} and is available in the Gene Expression Omnibus under accession GSE97179. DNA methylation was quantified over genomic features using a binomial model where the number of successes is the number of reads that support methylation (or accessibility) and the number of trials is the total number of reads. A CpG methylation rate was calculated for each genomic feature and cell using a maximum likelihood approach. The rates were subsequently transformed to M-values^{62} and modelled with a Gaussian likelihood.

As input to MOFA+ we filtered genomic features with low coverage (at least 3 CpG measurements or at least 10 CH measurements) and we selected the intersection of the top 5000 most variable sites across the different genomic and sequence contexts (see **Figure S8**). Details on the quality control and data preprocessing can be found in^{7}.

### scNMT: data processing

The gastrulation scNMT-seq multi-omics data was obtained from^{41}and is available in the Gene Expression Omnibus under accession GSE121708

Gene expression counts were quantified over protein-coding genes using featureCounts^{63} with the Ensembl gene annotation 87^{64}. The read counts were log-transformed and size-factor adjusted, and modelled with a Gaussian likelihood. As input to MOFA+, we filtered genes with a dropout rate higher 90% and we subsetted the top 5,000 most variable genes (after regressing out the stage effect). In addition, batch effects and the dropout rate per cell were regressed out prior to fitting the model.

DNA methylation and chromatin accessibility data were quantified over genomic features using a binomial model where the number of successes is the number of reads that support methylation (or accessibility) and the number of trials is the total number of reads. A CpG methylation or GpC accessibility rate for each genomic feature and cell was calculated by maximum likelihood. The rates were subsequently transformed to M-values^{62} and modelled with a Gaussian likelihood. As input to MOFA+ we filtered genomic features with low coverage (at least 3 CpG and 5 GpC measurements) and we selected the top 2500 most variable sites per combination of genomic context and data modality (see **Figure S14**) Details on the quality control and data preprocessing can be found in^{41}.

### Software availability

An open-source implementation of MOFA+ is available from https://github.com/bioFAM/MOFA2, which includes vignettes to reproduce the analyses presented in this article.

Also, we deploy an interactive web-based platform to facilitate the exploration of MOFA+ models (**Figure S21**). This is available from https://github.com/gtca/mofaplus-shiny

### Competing interests

All authors declare no competing financial interests

## Author contributions

R.A., D.A. and B.V. conceived the project.

R.A., D.A. D.B., Y.D and B.V. implemented the model.

R.A., D.A., J.C.M and O.S., interpreted results.

D.A. implemented the interactive web-based platform.

R.A. generated figures.

R.A., wrote the manuscript with feedback from all authors

J.C.M and O.S. supervised the project.

## Acknowledgements

R.A. is a member of Robinson College at the University of Cambridge. We thank Florian Buettner for comments on the manuscript.