Title : Metric learning enables synthesis of heterogeneous single-cell modalities

A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene/protein expression. Until recently, single-cell profiling experiments could measure only a single modality, leading to analysis focused on integrating information across separate experiments. However, researchers can now measure multiple modalities simultaneously in a single experiment, providing a new data paradigm that enables biological discovery but also requires new conceptual and analytic models. We therefore present Schema, an algorithm that leverages a principled metric learning strategy to synthesize multimodal information from the same experiment. To demonstrate the flexibility and power of our approach, we use Schema to infer cell types by integrating gene expression and chromatin accessibility data, perform differential gene expression analysis while accounting for batch effects and developmental age, estimate evolutionary pressure on peptide sequences, and synthesize spliced and unspliced mRNA data to infer cell differentiation. Schema can synthesize arbitrarily many modalities and capture sophisticated relationships between them, is computationally efficient, and provides a valuable conceptual model for exploring and understanding complex biology.

Simultaneous multimodal experiments present a new analytic challenge of synthesizing agreement and disagreement across modalities. For example, how should one interpret the data if two cells look similar transcriptionally but are different epigenetically? While some multimodal analysis accommodates only specific modalities (e.g., special tools for spatial transcriptomics 18,19,60 ), is aimed at just gene-set estimation 20,21 , or is limited only to a pair of modalities 20 , a general multimodal analysis paradigm that applies and extends to any data modality holds the promise of unifying these observations to inform biological discovery. Importantly, given the rapid biotechnological progress that continues to enable novel measurement modalities and easier simultaneous multimodal profiling, such a paradigm should scale to massive single-cell datasets and be robust to noise and sparsity in the data. Furthermore, the ability to synthesize more than just two modalities provides deeper insights and more robust (accurate) inferences, as we demonstrate.
We therefore present Schema, a method that synthesizes multimodal data based on a conceptual framework that accommodates any number of arbitrary modalities. Schema draws from metric learning [22][23][24][25] , the subfield of machine learning concerned with computing an accurate measure of similarity (equivalently, distance) on a dataset. Our critical insight is to interpret each modality as describing a measure of distance among the underlying cells; we can then newly formulate the synthesis problem as reconciling the information implied by these different distance measures. For a modality chosen as the reference, Schema computes a transformation of its data that achieves a desired level of agreement with data from other modalities (Figure 1). For example, with RNA-seq data as the reference modality, Schema can transform the data so that it incorporates information from other modalities but remains amenable to standard RNA-seq analyses (e.g. cell-type inference, trajectory analysis, and visualization). A particularly useful property of Schema is that it provides weights that encode the level of agreement between features in the reference modality and the other modalities, enabling biological interpretation of cellular covariation among modalities.
In this study, we demonstrate how the generality and utility of Schema can go beyond existing analyses of diverse multimodal datasets and extract novel biological insights. We synthesize RNA-seq and ATAC-seq modalities from multimodal data on 11,296 mouse kidney cells to infer cell types, with Schema enabling an 11% increase in accuracy over previously described approaches. On a dataset of 16,152 embryonic mouse cells from six samples, we use Schema to perform differential expression analysis that accounts for confounding batch effects.
By synthesizing spliced and unspliced mRNA counts, we newly infuse RNA "velocity" information into t-SNE embeddings, creating more insightful visualizations of cell differentiation. Schema is designed to support the continually-expanding breadth of single-cell technologies while retaining the power, tunability and interpretability required for effective exploratory analysis.

Multimodal synthesis as metric learning
Before the advent of multimodal single-cell experiments, computational analysis has focused on variation within a single modality. A critical problem brought about by the advent of multimodal single-cell experiments is how to reason about information across modalities in a mutually consistent way. Our key intuition is that each modality gives us information about the biological similarity among cells in the dataset, which we can mathematically interpret as a modality-specific distance metric. For example, in RNA-seq data, cells are considered biologically similar if their gene expression profiles are shared; this may be proxied as the Euclidean distance between normalized expression vectors, with shorter distances corresponding to greater similarity.
To synthesize these distance metrics, we draw inspiration from metric learning (Supp Text 1). Given a reference modality, Schema transforms this modality such that its Euclidean distances agree with a set of supplementary distance metrics from the other modalities, while also limiting the distortion from the original reference modality. Analyses on the transformed data will thus incorporate information from all modalities (Figure 1).
In our approach, the researcher starts by designating one of the modalities as the primary (i.e., reference) modality, consisting of observations that are mapped to points in a multidimensional space. In the analyses presented here, we typically designate the most informative or high-confidence modality as the primary (i.e., reference), with RNA-seq being a frequent choice (Discussion). The coordinates of points in the primary modality are then transformed using information from secondary modalities. Importantly, the transformation is constrained to limit the distortion to the primary modality below a researcher-specified threshold. This acts as regularization, preventing Schema from overfitting to other modalities and ensuring that the high-confidence information contained in the primary modality is preserved. We found this constraint to be crucial to successful multimodal syntheses. Without it, an unconstrained alignment of modalities using, for instance, canonical correlation analysis (a common approach in statistics for inferring information from cross-covariance matrices), is prone to overfitting to sample-specific noise, as we show in our results.
To see how Schema's transformation synthesizes modalities, consider the case where the primary dataset is gene expression data. While the points close in Euclidean space are likely to be biologically similar cells with shared expression profiles, longer Euclidean distances are less informative. Schema's constrained optimization framework is designed to preserve the information contained in short-range distances, while allowing secondary modalities to enhance the informativity of longer distances by incorporating, for example, cell-type metadata, differences in spatial density, or developmental relationships. To facilitate the representation of complex relationships between modalities, arbitrary distance metrics and kernels are supported for secondary modalities.
Schema's measure of inter-modality alignment is based on the Pearson correlation of distances, which is optimized via a quadratic programming algorithm, for which further details are provided in Methods. An important advantage of Schema's algorithm is that it returns coefficients that weight features in the primary dataset based on their agreement with the secondary modalities (for example, weighting genes in a primary RNA-seq dataset that best agree with secondary developmental age information). In this study, we demonstrate this interpretability in our applications of Schema.

Inferring cell types by synthesizing gene expression and chromatin accessibility
We first sought to demonstrate the value of Schema by applying it to the increasingly common and broadly interesting setting in which researchers simultaneously profile the transcriptome and chromatin accessibility of single cells 6 . Focusing on cell type inference, a key analytic step in many single-cell studies, we applied Schema on a dataset of 11,296 mouse kidney cells with simultaneously assayed RNA-seq and ATAC-seq modalities and found that synthesizing the two modalities produces more accurate results than using either modality in isolation ( Figure 2F, Supp Figure 3).
With RNA-seq as the primary (i.e., reference) dataset and ATAC-seq as the secondary, we applied Schema to compute a transformed dataset in which pairwise RNA-seq distances among cells are better aligned with distances in the ATAC-seq peak counts data while retaining a very high correlation with primary RNA-seq distances (≥ 99%, Methods). We then clustered the cells by performing Leiden community detection 50 (a recently introduced modification of the Louvain algorithm 51 ) on the transformed dataset and compared these clustering assignments to the Leiden clusters obtained without Schema transformation. We used the expertly defined-labels from Cao et al. 6 as the ground truth cluster labels and quantified clustering agreement with the adjusted Rand index (ARI), which has a higher value if there is greater agreement between two sets of labels. Leiden clustering on Schema-transformed data better agrees with the ground truth annotations of cell types (ARI of 0.46) than the corresponding Leiden cluster labels using just RNA-seq or ATAC-seq datasets individually (ARIs of 0.40 and 0.04, respectively, Figure 2F).
Here, Schema facilitated a biologically informative synthesis despite limitations of data quality or sparsity in the ATAC-seq secondary modality. We observed that using only ATAC-seq data to identify cell types leads to poor concordance with ground-truth labels (Supp Figure 3), likely because of the sparsity of this modality (for example, only 0.28% of the peaks were reported to have non-zero counts, on average); this sparsity was also noted by the original study authors. We note that an unconstrained synthesis of the modalities using canonical correlation analysis (CCA) resulted in an ARI of 0.31, lower than what is achieved by using just the RNA-seq modality ( Figure 2F). However, since Schema constrains the ATAC-seq modality's influence when synthesizing it with RNA-seq data, we could extract additional signal provided by ATAC-seq while preserving the rich information provided by the transcriptomic modality. We also evaluated a heuristic approach described in the original study: group cells into small clusters ("pseudocells") by RNA-seq similarity and compute an average ATAC-seq profile per pseudocell, using these profiles for the final clustering (Methods). This approach also underperformed Schema (ARI of 0.20).
To further analyze why combining modalities improves cell type clustering, we obtained Leiden cluster labels using either the RNA-seq or the ATAC-seq modalities individually. We then evaluated these cluster assignments by iterating over subsets of the data, each set covering only a pair of ground-truth cell types, and used the ARI score to quantify how well the cluster labels distinguished between the two cell types. While RNA-seq clusters have higher ARI scores overall, indicating a greater ability to differentiate cell types, ATAC-seq does display a relative strength in distinguishing proximal tubular (PT) cells from other cell types (Figure 2A). PT cells are the most numerous cells in the kidney dataset and many of the misclassifications in the RNAseq based clustering relate to these cells (Figure 2B-D). When the two modalities are synthesized with Schema, a significant number of these PT cells are correctly assigned to their ground truth cell types (one-sided binomial test, p = 6.7 x 10 -15 ), leading to an overall improvement in clustering quality ( Figure 2E).

Differential expression analysis while accounting for batch effects and developmental age
Aside from cell type inference, another important single-cell analysis task that stands to benefit from multimodal synthesis is the identification of differentially expressed marker genes. To illustrate how, we explored a mouse gastrulation single-cell dataset 57 , consisting of 16,152 epiblast cells split over three developmental timepoints (E6.5, E7.0, and E7.25) and with two replicates at each timepoint, resulting in six distinct batches ( Figure 3A). Applying Schema to this dataset, we sought to identify differentially expressed genes that are consistent with the developmental time course while being robust to batch effects between the replicate pairs. To perform differential expression analysis with Schema, RNA-seq data should be used as the primary modality, while the distance metrics of the secondary modalities specify how cells should be differentiated from each other. Here, we used batch and developmental-age information as secondary modalities, configuring Schema to maximize RNA-seq data's agreement with developmental age and minimize its agreement with batch information. We weighted these co-objectives equally; results were robust to ± 25% variations in these weights (Supp Figure 9). We used RNA-seq data as the primary dataset, representing it by its top ten principal components (Methods).
We evaluated Schema alongside MOFA+, a recently introduced single-cell multimodal analysis technique 21,53  In addition to accounting for batch effects, we could also configure Schema to reduce the weight of transient changes in expression, thus identifying genes with monotonically changing expression along the time course (Figure 3B-D). To do so, we encoded developmental age as a distance metric by specifying zero distance between cells at the same timepoint, unit distance between directly adjacent timepoints, and an additive sum of the unit distances across more separated timepoints. As a control, we also tested a metric that did distinguish between the stages but did not increase in time, finding that the highest-weighted feature (PC5) in that case was indeed non-monotonic (Methods, Figure 3B-C). To encode batch effect as a distance metric, we specified zero distance between cells in the same replicate and unit distance otherwise.
We estimated the set of differentially expressed genes as the top-loading genes of the principal components up-weighted by Schema. Seeking to evaluate if the Schema or MOFA+ genes did show time-dependent monotonicity in expression, we linearly regressed each identified gene's normalized expression against an ordering of the three developmental stages (Methods).
We found that the Schema genes corresponded to regression coefficients significantly different from zero ( Figure 3D-E), consistent with time-dependent monotonicity (two-sided t-test, p = 3.83 x 10 -6 ); this was not true of MOFA+ (p = 0.77).
Next, we evaluated the batch-effect robustness of Schema and MOFA+ gene sets. Our configuration of Schema balances batch-effect considerations against differential expression considerations. For instance, introducing the batch-effect objective in Schema reduces the weights associated with the first and second principal components (PC1 and PC2), which show substantial within-timepoint batch-effect variations without a compensating time-dependent monotonicity, by 11% and 17%, respectively. In comparison, explicitly up-weighting "good" variation or down-weighting "bad" variation is difficult when using MOFA+. To systematically evaluate the batch-effect robustness of Schema and MOFA+ gene sets, we constructed benchmark sets of differentially expressed genes by applying a standard statistical test, adjusting for batch effects by exploiting the combinatorial structure of this dataset. Specifically, we aggregated over computations that each considered only one replicate per timepoint (Methods).
We then measured the overlap of Schema and MOFA+ gene sets with these benchmarks ( Figure   3F) and found that, compared to MOFA+, the Schema gene set shows a markedly higher overlap with the benchmarks that is statistically significant (hypergeometric test with Bonferroni correction, p = 5.9 x 10 -12 for the benchmark set of size 188). Schema allows us to express the intuition that variation attributable to batch effects should be ignored while variation attributable to developmental age should be highlighted.

Spatial density-informed differential expression among cerebellar granule cells
We performed some preliminary analysis with Schema of spatial transcriptomics data, another increasingly important multimodal scenario, here encompassing gene expression, celltype labels, and spatial location. We obtained Slide-seq data containing 62,468 transcriptomes that are spatially located in the mouse cerebellum. In the original study, these transcriptomes were assigned to putative cell types (noting that these transcriptomes are not guaranteed to be single-cell), and thus cell types are located throughout the tissue 10,26 . Interestingly, we observed spatial density variation for certain cell types; specifically, transcriptomes corresponding to granule cell types are observed in regions of both high and low spatial density (Supp Figure 1B in this paper; also Figure 2B of Rodriques et al. 10 ). We therefore reasoned that we could use Schema to identify genes that are differentially expressed in granule cells in high density areas versus granule cells in low density areas. Schema is well suited to the constrained optimization setting of this problem: we want to optimize for genes expressed specifically in granule cells and in dense regions, but not all granule cells are in dense regions and not all cells in dense regions are granule cells. We specified RNA-seq data as the primary modality and spatial location and cell-type labels as the secondary modalities, with spatial density controlled by a distance metric that scores two cells as similar if their spatial neighborhoods have matching densities (Methods).

Schema outperforms alternative methods for spatial transcriptomic analysis
We sought to benchmark our method by comparing the robustness of Schema's results with those based on canonical correlation analysis (CCA) and with two methods specifically intended for spatial transcriptomics, namely SpatialDE 18 and Trendsceek 19 .
An important point is that CCA, SpatialDE, and Trendsceek are less general than Schema and therefore require non-trivial modifications to approximately match Schema's capabilities.
CCA is limited in that it can correlate only two datasets at a time, whereas here we seek to synthesize three modalities: gene expression, cell-type labels, and spatial density. We adapted CCA by correlating two modalities at a time and combining the sub-results (Methods). In the case of SpatialDE and Trendsceek, their unsupervised formulation does not allow the researcher to specify the spatial features to pick out (we focus on spatial density variation). To adapt these, we collated their results from separate runs on granule and non-granule cells (Methods).
Notably, the ad hoc modifications required to extend existing methods beyond two modalities underscore the benefit of Schema's general analytic formulation that can be naturally extended to incorporate any number of additional data modalities.
To evaluate the stability and quality of spatial transcriptomic analysis across different techniques, we analyzed three replicate samples of mouse cerebellum tissue (coronal sections prepared on the same day 10 ; pucks 180430_1, 180430_5, 180430_6) and compared the results returned separately for each replicate (Methods). While both Schema and CCA identify a gene set that ostensibly corresponds to granule cells in dense regions (Supp Figure 1D 35.7% and 59.5%, respectively (FDR q < 0.001 in all cases). We therefore find that Schema's results are substantially more robust across the three replicates. We also find that Schema, in not seeking an unconstrained optimum, is more robust to overfitting to sample-specific noise than CCA (Methods; Supp Figure 1E).
When performing the same gene list robustness analysis with SpatialDE and Trendsceek, while also looking at the stability of their gene rankings specific to the precursor cell type (gray points in Supp Figure 1E), we found that SpatialDE produced slightly more stable gene rankings than Trendsceek, with median sample-pair correlations of 0.089 and -0.002, respectively, but these were still much lower than those for Schema. We also observed that SpatialDE and Trendsceek had substantially longer running times and we performed our analysis of the two methods on subsets of the overall dataset (see "Schema can scale to massive singlecell datasets" for precise runtime and memory usage). These results demonstrate the robustness and efficiency of Schema's supervised approach.

Genomic location and accessibility inform variability in expression
We next sought to demonstrate the flexibility of Schema to analyses beyond cell type clustering and differential expression analysis. We turned to a study that simultaneously profiled gene expression and chromatin accessibility from 3,260 human A549 cells 6 . Using Schema, we characterized the genomic regions (relative to a gene's locus) where chromatin accessibility strongly correlates with the gene's expression variability, i.e., regions whose accessibility is differentially important for highly variable genes. Schema assigned the highest weight to features associated with chromatin accessibility over long ranges, i.e., ~10 megabase (Mb) (Supp Figure   2C). Searching for an explanation, we investigated if highly variable genes share genomic neighborhoods and mapped gene loci to topologically associated domains 32 (TAD) of this cell type. We found strong statistical evidence that highly variable genes are indeed clustered together in TAD compartments (Supp Figure 2D), supporting our findings from Schema and suggesting an epigenetic role in controlling gene expression variability.
This demonstration illustrates how Schema's generality facilitates innovative explorations of multimodal data beyond, for example, cell type clustering or differential gene expression analysis. Here we chose genes to be the unit of observation, allowing us to design a primary dataset that links each gene's RNA-seq measurements to ATAC-seq peak counts in its neighborhood. Specifically, each primary feature corresponds to a genomic window relative to the gene's locus and is scored as the sum over all cells of peak counts in the window, each cell weighted by the gene's expression. We created multiple features, each corresponding to a specific window size and placement (Methods, Supp Figure 2B). Reusing RNA-seq as the secondary modality, we designed a distance metric that captures similarity in expression variability: for each gene, we normalize and sort the vector of its expression values across cells so that identical vectors imply an identical pattern of gene expression variation (Methods). We used Schema as a feature-selection mechanism, where up-weighted features correspond to genomic windows important for explaining gene expression variability.
We then benchmarked this feature selection approach against a ridge regression 54 where features of the primary modality were specified as the explanatory variables and the standard deviation of each gene's expression (summarized from the secondary modality) was the response variable. Both analyses agreed in assigning the highest weights to features corresponding to long-range (~10 Mb) genomic regions upstream and downstream of a gene. However, Schema's regularization mechanism helps produce more consistent and stable feature weights, as evaluated on subsets of genes grouped by strand orientation or chromosome (Supp Figure 2C and Supp Table 3).
Our finding that chromatin accessibility at a long range (~10 Mb) may play a role in gene expression variability aligns with the observation by Kim et al. 33 that long-range DNA methylation predicts gene expression. To further investigate the connection between chromatin accessibility and gene expression variability, we analyzed gene membership in TADs, hypothesizing that gene expression variability is likely to be influenced by the organization of TADs in the genome. We analyzed the clustering of highly variable genes (HVGs) on TADs within A549 cells (inferred from Hi-C data, accession ENCFF336WPU in ENCODE 34,35 ) and found that HVGs are indeed more likely to be clustered together in TADs (Supp Figure 2D). By two independent permutation-based tests (Methods), we were able to reject the null hypothesis that genes in the top quartile of variability are dispersed randomly across TADs (p < 0.004 in both cases, with Bonferroni correction). Schema-based feature weighting therefore revealed an association between genomic architecture and gene variability. Notably, these results show how synthesis of multimodal RNA-and ATAC-seq data not only benefits standard analyses like cell type inference, but also enables creative and diverse exploratory data analysis.

Schema reveals CDR3 segments crucial to T-cell receptor binding specificity
To further demonstrate the generality of Schema, we applied it to synthesize data modalities beyond gene expression. We integrated single-cell multimodal proteomic and functional data with Schema to better understand how sequence diversity in the hypervariable CDR3 segments of T-cell receptors (TCRs) relates to antigen binding specificities 37 . De novo design of TCRs for an antigen of interest remains a pressing biological and therapeutic goal 40,41 , making it valuable to identify the key sequence locations and amino acids that govern the binding characteristics of a CDR3 segment. Towards this end, we analyzed a single-cell dataset that recorded clonotype data for 62,858 T-cells and their binding specificities against a panel of 44 ligands 5 , and used Schema's feature-selection capabilities to estimate the sequence locations and residues in the CDR3 segments of α and β chains important to binding specificity.
To estimate location-specific selection pressure, we ran Schema with the CDR3 peptide sequence data as the primary modality and the binding specificity information as the secondary modality, performing separate runs for α and β chains. In the primary modality, each feature corresponds to a CDR3 sequence location and we used the Hamming distance metric between observations (i.e. the number of locations at which two sequences differ, see Methods). Schema assigned relatively low feature weights to the location segments 3-9 (in α chain CDR3) and [5][6][7][8][9][10][11][12] (in β chain CDR3), suggesting those regions can tolerate greater sequence variability while preserving binding specificity.
To evaluate these results, we compared them to estimates based on CDR3 sequence motifs sourced from VDJdb 38 , a curated database of TCRs with known antigen specificities. In VDJdb, TCR motifs are scored using an adaptation of the relative-entropy algorithm 39  Next, we used Schema to investigate the selection pressure on amino acids present in the variability-prone locations identified above (Methods). We first selected a sequence location (e.g., location 4 in α chain CDR3) and constructed a primary modality where each cell was represented by a one-hot encoding of the amino acid at the location (i.e., a 20-dimensional boolean vector). The secondary modality was binding specificity information, as before. We performed separate Schema runs for each such location of interest on the two chains, computing the final score for each amino acid as the average score across these runs. These scores are in good agreement with the corresponding amino acid scores aggregated from the VDJdb database (Spearman rank correlation = 0.74, two-sided t-test p = 2 x 10 -4 ). The residue and location preferences estimated here can directly be used in any algorithm for computational design of epitope-specific CDR3 sequences to bias its search towards more functionally plausible candidate sequences.
Schema's ability to efficiently synthesize arbitrarily many modalities, with their relative importance at the researcher's discretion, allows information that might otherwise be set aside (e.g., metadata like batch information, cell line, or donor information) to be effectively incorporated, enhancing the robustness and accuracy of an analysis. In Methods, we exemplify this use-case on the TCR dataset by incorporating measurements of cell-surface markers as an additional secondary modality, hypothesizing that cell-surface protein levels should be unrelated to V(D)J recombination variability.

Schema highlights secondary patterns while preserving primary structure
Another powerful use of Schema is to infuse information from other modalities into RNA-seq data while limiting the data's distortion so that it remains amenable to a range of standard RNA-seq analyses. Having demonstrated this capability for cell type inference, we now explore another use case. Since widely-used visualization methods such as UMAP 42 do not allow a researcher to specify aspects of the underlying data that they wish to highlight in the visualization, we sought to apply Schema to improve the informativity of single-cell visualizations. We leveraged Schema to highlight the age-related structure in an RNA-seq dataset of Drosophila melanogaster neurons 3 profiled across a full lifespan, while still preserving most of the original transcriptomic structure. We chose RNA-seq as the primary modality and temporal metadata as the secondary modality, configuring Schema to maximize the correlation between distances in the two while constraining the distortions induced by the transformation (Methods). We then visualized the transformed result in two dimensions with UMAP.
While some age-related structure does exist in the original data, Schema-based transformation of the data more clearly displays a cellular trajectory consistent with biological age (Figure 5). Importantly, accessing this structure required only a limited distortion of the data, corresponding to relatively high values (≥ 0.99) of the minimum correlation constraint ( Figure 5C). To verify that Schema was able to infuse additional age-related structure into RNAseq data, we performed a diffusion pseudotime analysis of the original and transformed datasets and found that the Spearman rank correlation between this pseudotime estimate and the groundtruth cell age increased from 0.365 in the original data to 0.405 and 0.436 in the transformations corresponding to minimum correlation constraints of 0.999 and 0.99, respectively. In contrast, an unconstrained synthesis by CCA leads to a lower correlation (0.059) than seen in the original RNA-seq dataset; the corresponding CCA-based UMAP visualization is also less clear in conveying the cellular trajectory (Supp Figure 10). Schema thus enables visualizations that synthesize biological metadata, while preserving much of the distance-related correlation structure of the original primary dataset. With Schema, researchers can therefore investigate single-cell datasets that exhibit strong latent structure (e.g., due to secondary metadata like age or spatial location), needing only a small transformation to make that structure visible.

Schema can synthesize spliced and unspliced RNA counts to accentuate cell differentiation
We next leveraged the flexibility of Schema to study cell differentiation by synthesizing spliced and unspliced mRNA counts in a dataset of 2,930 mouse dentate gyrus cells 57 . Specifying spliced counts as the primary dataset and unspliced counts as the secondary dataset, we configured Schema to compute a transformation of the spliced data that maximizes the correlation of its Euclidean distances with those in the unspliced dataset while distorting the former only minimally (Figure 6A-C).
Our intuition here is the same as that underlying RNA velocity techniques: correlating

Schema can scale to massive single-cell datasets
We have designed Schema to process large single-cell datasets efficiently, with modest memory requirements. On average, Schema processes data from a Slide-seq replicate (three modalities, 20,823 transcriptomes x 17,607 genes) in 34 minutes, requiring less than 5GB of RAM in the process (Supp Table 4). The runtime includes the entire set of Schema sub-runs performed over an ensemble of parameters, as well as the time taken for the pre-processing transformation.
Schema's efficiency stems from our novel mathematical formulation. Deviating from standard metric learning approaches, we formulate the synthesis problem as a quadratic-program optimization, which can be solved much faster than the semi-definite program formulations typically seen in these approaches (Supp Text 1). Additionally, while the full Schema algorithm has quadratic scalability in the number of cells, our formulation allows us to obtain good approximations with provably bounded error using only a logarithmic subsample of the dataset (Supp Text 3), enabling sublinear scalability in the number of cells that will be crucial as multimodal datasets increase in size.

Discussion
We designed Schema to be a powerful approach to multimodal data analysis. Schema is based on an elegant conceptual formulation in which each modality is defined using a distance metric. The strength of this intuition enables analysis of an arbitrary number of modalities and applicability to any modality, so long as it is possible to define an appropriate distance metric.
Our approach is supervised, allowing the researcher to control which modality to transform, the amount of distortion, and the desired level of agreement between modalities.
While existing methods like Seurat v3 16 and LIGER 17 are designed for unsupervised discovery of common patterns across experiments, Schema's supervised formulation facilitates a broader set of investigations, enabling us to not only infer cell types and identify gene sets but also rank amino acids by selection pressure and estimate the relative importance of different genomic neighborhoods in epigenetic regulation.
When choosing a primary (i.e. reference) modality, we recommend selecting the most high-confidence modality or the one for which feature selection will be most informative. In many of our demonstrations, we chose RNA-seq as the primary modality since it is often the modality where preprocessing and normalization are best understood, boosting our confidence in it; additionally, transformed RNA-seq data lends itself to a variety of downstream analyses. Once a primary modality has been designated, Schema can synthesize an arbitrary number of secondary modalities with it. In contrast, methods designed around pairwise modality comparison need ad hoc adaptations to accommodate additional modalities. Schema's approach is advantageous not only for datasets with more than two modalities 5,44 but also in cases where metadata (e.g. batch information and cell age ) can be productively incorporated as additional modalities.
Intuitively, our correlation-based alignment approach has parallels to kernel canonical correlation analysis (kernel CCA), a generalization of CCA where arbitrary distance metrics can be specified when correlating two datasets. While Schema offers similar flexibility for secondary modalities, it limits the primary modality to Euclidean distances. Introducing this restriction enhances scalability, interpretability and robustness. Unlike kernel CCA, the optimization in Schema operates on matrices whose size is independent of the dataset's size, enabling it to scale sub-linearly to massive single-cell datasets. Also, the optimal solution is a scaling transform that can be naturally interpreted as a feature-weight vector. Perhaps most importantly, Schema differs from kernel CCA in performing a constrained optimization, thus reducing the distortion of the primary dataset and ensuring that sparse and low-confidence secondary datasets do not drown out the primary signal.
The constrained optimization in Schema acts as regularization, helping ensure that the computed transformation and feature selection remain biologically meaningful. By choosing a high-confidence modality as the primary modality and bounding its distortion when incorporating the secondary modalities, Schema enables information synthesis while retaining high-confidence insights. This bound on the distortion is an important parameter, directly controlling how much the secondary modalities inform the primary dataset; values approaching 1 will increasingly limit the influence of the secondary modalities. Therefore, we recommend that studies using Schema for feature selection should aggregate the results over a range of values of this parameter while analyses that utilize only a single parameter should keep it high (≥ 0.9, the default setting in our implementation is 0.99) to preserve fidelity with the original dataset. We also strongly recommend that studies with a single parameter should report the value of this parameter alongside their results.
Interesting future methodological work could explore alternative formulations of the Schema objective, potentially including more complex nonlinearities than our quadratic-program formulation. Schema can also guide further biological experiments that profile only the highlyweighted features based on other data modalities, enabling efficient, targeted follow-up analysis.
Given the current pace of biotechnological development, we anticipate that highthroughput experiments, and their conclusions, will increasingly rely on more than one data modality, underscoring the importance of Schema and its conceptual framework. Schema is publicly available for use at http://schema.csail.mit.edu and as the Python package schema_learn.

Correlation-based alignment and quadratic programming optimization
Underlying our definition of the alignment of metrics is the intuitive notion that metrics are similar if the ordering of pairwise distances between the two metrics are close. A proxy for measuring this alignment is the Pearson correlation coefficient. For Schema, the goal is thus that pairwise distances in the transformed space are highly correlated with pairwise distances under each metric.
One of the advantages of the Pearson correlation coefficient is that it is amenable to optimization via quadratic programming (QP). QP is a generalization of linear programming, allowing a quadratic objective function. We learn a scaling transformation (Supp Text 1) on the primary dataset such that the pairwise distances of the transformation * (where * denotes coordinate-wise multiplication, for each ∈ ) are highly correlated with the pairwise distances in the secondary modalities. We codify our intuition of the importance of the primary dataset by ensuring that the correlation of transformed pairwise distances with the original dataset is higher than some researcher-specified threshold. The scaling transformation has the appealing property of being interpretable as a feature selection. The higher the coordinate , the more important that coordinate is for alignment. Thus, by selecting the top coordinates by their weights, we can access the genes most important for aligning the modalities.

Mathematical formulation
Suppose we have observations across datasets , = 1,2, … , , where = { ( ) : = 1,2, … , } contains data (categorical or continuous) for each observation. We will refer to 1 as the primary dataset and the rest as secondary. Each dataset's dimensionality and domain may vary. In particular, we assume 1 is -dimensional. Each dataset should also have some notion of distance between observations attached to it, which we will denote , so is the distance between observations and in . Since our entire framework below deals in squared distances, for notational convenience we will let be the squared distances between points in ; also, we drop the superscript in (1) when referring to the primary dataset 1 and its data.
The goal is to find a transformation with ( ) generating a dataset * such that the Euclidean metric * on * aligns the various metrics , each informed by its respective modality. Non-Euclidean primary distance metrics * are allowed if they can be computed as a sum of k terms, one for each feature (e.g. Hamming distance). We emphasize that none of the secondary need to be Euclidean. This setup is quite general, and we now specify the form of the transformation and the criteria for balancing information from the various metrics. Here, we limit to a scaling transform. That is, ( ) = { ( ) ∶ ∈ } for some ∈ and ( ) is a × diagonal matrix with as its diagonal entries. Then, the squared distance between points under the transformation is given by: where is the element-wise square of , i.e. = 2 . The scaling transform acts as a feature-weighting mechanism: it chooses the features of 1 that align the datasets best (i.e. being large means that the th coordinate of 1 is important). We note here that a natural extension would be allowing general linear transformations for ; however, in that context, the fast framework of quadratic programming would need to be substituted for the much slower framework of semidefinite programming.
Here, our approach to integration between the metrics is to learn a metric * that aligns well with all of them. Our measure of the alignment between * and is given by the Pearson correlation between pairwise squared distances under two metrics. Intuitively, maximizing the correlation coefficient encourages distances under * to be large when the corresponding distances are large and vice versa. This can be seen from the expression To deal with multiple modalities, we try to maximize the correlation between * and the distances on each of the metrics, allowing the user to specify how much each modality should be weighted. We also allow a hard constraint, whereby the correlation between the pairwise distances in the transformed data and in the primary dataset is lower-bounded. Our goal is thus to where and are hyperparameters that determine the importance of the various metrics. We have also highlighted that * is a function of and is determined entirely by the solution to (2).
In the rest of our discussion, we will abuse notation and primarily use , rather than * , to refer to the optimal metric. The machinery of quadratic programming makes this optimization feasible.

Setting up the quadratic program
As motivated above, quadratic programming (QP) is a framework for constrained convex optimization problems that allows a quadratic term in the objective function and linear constraints. The general form is for given , , , ℎ, , and , where is a positive semidefinite (psd) matrix and the notation ≼ means the inequality is true for each coordinate (i.e., ≼ means ≤ for all ).
To put our optimization (2) in a QP formulation, we expand the covariance and variance terms in the definition of correlation in (1), and show that the covariance is linear in the transformation and variance is quadratic: where and are -dimensional vectors that depend only on ; and and are × matrices that depend only on 1 ; and is the set of pairs of observations, where |•| denotes set cardinality. It is also not hard to show that There is one more difficulty to address. The correlation is the quotient of the covariance and the standard deviation, and the QP framework cannot handle quotients or square roots. However, maximizing a quotient can be reframed as maximizing the numerator (the covariance), minimizing the denominator (the variance), or both.
We now have the ingredients for the QP and can frame the optimization problem as Here, is the hyperparameter for regularization of , which we want to penalize for being too far away from the all-ones vector (i.e., equal weighting of all the features). One could also regularize the 2 norm of alone (i.e., incorporate the term − ‖ ‖ 2 ), which would encourage to be small; we have found that empirically the choices yield similar results.
This program can be solved by standard QP solvers (see Supp Text 2) for the full details of how to put the above program in canonical form for a solver), and the solution * can be used to transform unseen input data, using * ∈ , where * = √ * .

Hyperparameters
A well-known challenge for machine learning algorithms is interpretability of hyperparameters. Here, the QP solver needs values for , , and , and specifying these in a principled way is a challenge for users. Our approach is thus to allow the user to specify more natural parameters. Specifically, we allow the user to specify minimum correlations between the pairwise distances in * and the primary dataset 1 . Formally, the user can specify such that ( * , 1 ) ≥ and such that The quantity thus controls the maximum weight that any one feature can take.
While these quantities are not directly optimizable in our QP formulation (5), we can access them by varying the hyperparameters , , and .
Intuitively, we note that the choice of controls whether satisfies and that and control whether the correlation constraint is satisfied. To satisfy these constraints, we simply grid search across feasible values of { , , }: we solve the QP for fixed values of , , and , keeping only the solutions for which the { , } constraints are satisfied. Of these, we choose the most optimal. The efficiency of quadratic programming means that such a grid search is feasible, which gives users the benefit of more easily interpretable and natural hyperparameters.

Recommendations for setting s and q
We recommend that only s (minimum correlation) and not q (maximum feature weight) be used to control Schema's optimization. The default value of q in our implementation is set to be very high (10 3 ) so that it is not a binding constraint in most cases. We recommend not changing it and in future versions of Schema we may reformulate the QP so that q is entirely removed. To limit the distortions in the primary modality, we recommend that s be set close to 1: the default setting of s is 0.99 and we recommended values ≥ 0.9. When Schema is used for feature selection, we recommend aggregating results across an ensemble of runs over a range of s values (a wide range is recommended here) to increase the robustness of the results.

Preprocessing transforms
Standard linear decompositions, like PCA or NMF are useful as preprocessing steps for Schema. PCA is a good choice in this regard because it decomposes along directions of high variance; NMF is slower, but has the advantage that it is designed for data that is non-negative (e.g., transcript counts). Since the transform that we generate can be interpreted as a featureweighting mechanism, we can identify the directions (in PCA) or factors (in NMF) most relevant to aligning the datasets. Here the user can employ arbitrary feature-sets including, for instance, a union of features from two standard methods (e.g., set-union of PCA and CCA features) or those generated by another single-cell analysis method like MOFA+ 21 .

Motivating the choice of correlation as an objective
As a measure of the alignment between our transformation and a dataset, correlation of pairwise distances is a flexible and robust measure. An expanded version of arguments in this paragraph is available in the appendices. Given a pair of datasets, the connection between their pairwise-distance Spearman rank correlation and the neighborhood-structure similarity is deep: if the correlation is greater than 1 − , the fraction of misaligned neighborhood-relationships will be less than (√ ). There is a manifold interpretation that is also compelling: assuming the highdimensional data lies on a low-dimensional manifold, small Euclidean distances are more accurate than large distances, so the local neighborhood structure is worth preserving. We can show intuitively that optimizing the correlation aims to preserve local neighborhood structure.
Using correlation in the objective also affords the flexibility to broaden ( , ) in (2) to any function of the metric, i.e., ( , ∘ ); this allows us to invert the direction of alignment or more heavily weigh local distances. As RNA-seq dataset sizes reach millions of cells, even calculating the ( 2 ) pairwise distances becomes infeasible. In this case, we sample a subset of the pairwise distances. As an estimator, sample correlation is a robust measure, allowing Schema to perform well even with relatively small subsets; in fact, we only need a sample size logarithmic in our desired confidence level to generate high-confidence results (Supp Text 3).
This enables Schema to continue scaling to more massive RNA-seq datasets.

Inference of cell types by synthesizing gene expression and chromatin accessibility
Applying the TruncatedSVD function in the Python library scikit-learn 54  We performed canonical correlation analysis (CCA) on the same dimensionality-reduced primary and secondary datasets as supplied to Schema and computed 30 CCA factors, performing Leiden clustering using these. To implement the heuristic approach described by Cao et al. 6 , we grouped the 11,296 cells into k=300 clusters by k-means clustering of RNA-seq data; results were robust to the choice of k. Each cluster ("pseudocell") was represented by the average ATAC-seq profile of its member cells, with these aggregated profiles forming the input to the Leiden clustering algorithm.

Differential expression analysis while accounting for batch effects
This mouse gastrulation dataset was originally described by Pijuan-Sala et al. 57 and investigated by Argelaguet et al. 21,53 using the MOFA+ algorithm. We operated on the data as preprocessed and made available by them and, for the MOFA+ evaluation in this paper, also used their pretrained models.
We first reduced the RNA-seq data (primary modality) to its top 10 principal components We configured Schema to use batch information as a secondary modality with weight -1 and developmental age information as a secondary modality with weight +1; thus, correlation with the former was minimized and the latter was maximized. The minimum correlation threshold was set to 0.9; we found that the results were robust to variations in this setting (0.8 and 0.95).
Since Schema accepts arbitrary distance measures on secondary datasets, we could investigate the impact of treating developmental timepoints as categories rather than a time ordering. In the categorical distance metric, we defined two cells to be at distance 0 if they were at the same developmental timepoint and at distance 1 otherwise. In the time-ordering metric, we specified the first and third time-points to be at distance 2 apart and the middle time-point to be at distance 1 from either end. The two distance measures lead to different feature-selection results from Schema, reflecting the distinct underlying variations in expression profiles. For category-based distances, PC5 receives the highest weight while PCs 4 and 6 are given higher weights in the time-ordering case ( Figure 3B). This happens because the mean expression level of PC5 shows large variation across the three time-points but does not change monotonically along the time course; in contrast, the PCs 4 and 6 display expression profiles that change monotonically with developmental age (Figure 3C). Schema's flexibility to incorporate a distance measure that highlights specific variability patterns can thus enable researchers to identify precisely targeted gene-sets.
To create a gene set from Schema's feature weighting, we selected the intersection of k  We created batch-effect adjusted benchmark gene sets by using different combinations of replicates. One can create a subset of the original dataset by sampling cells from only one of the two replicates at each time-point. By iterating over all possible combinations of replicates, we created eight such subsets. These subsets differ in the batch information they contain but share the same developmental age information. Using the Wilcoxon rank sum test in scanpy 36 , we identified genes differentially expressed between the first (E6.5) and last (E7.25) stage in each subset and defined the benchmark gene set to consist of genes that are differentially expressed across a majority of the subsets. The benchmark set is thus robust to batch effects, being comprised of genes whose differential expression stands out across different replicates (i.e. batches). By varying the thresholds of the test, we could create benchmark sets of varying sizes and measured the overlap of Schema and MOFA+ gene sets with these. The Schema gene set has a higher overlap and for benchmark sets of all sizes, its overlap with them was significant (hypergeometric test with Bonferroni correction, p = 5.9 x 10 -12 for the benchmark set of size 188 and Bonferroni-corrected p < 10 -5 for benchmark sets of all sizes, Figure 3E); this was not the case for MOFA+.

Spatially-informed differential expression on mouse brain Slide-seq
We used gene expression as the primary modality, while spatial density and cell type labels were the secondary modalities. We first computed spatial density information for each cell by learning a two-dimensional Gaussian-kernel density function on cell locations; it assigns higher scores to regions with denser cell packing ( Figure 2C). We then ran Schema using the gene expression matrix as the primary dataset, with the secondary datasets consisting of the numeric kernel-density scores, as well as the categorical labels corresponding to the four most common non-granule cell types. We aimed to find a transformation of the primary data that maximized correlation with cell spatial density while preserving a high correlation with granule cell type labels. Additionally, differences in cell-type distribution between dense and sparse regions are a confounding factor when seeking to identify a gene set specific to the granule cell type. To mitigate this, we assigned a small negative weight to correlation with non-granule cell type labels in Schema's objective function. The primary dataset was pre-processed with a non-negative matrix factorization (NMF) transformation, limiting it to the top 100 NMF factors. Each Schema run consisted of multiple sub-runs over an ensemble of parameter settings, with the results averaged across these. The gene scores from each sub-run were a weighted average of the features with each feature's weight as , w being the Schema-computed weights; cell loadings were computed similarly. This softmax approach is parameter-free and ensures that gene rankings are informed primarily by the features with the highest Schema weight.
To adapt CCA for a three-way modality synthesis, we tested two approaches: (1) combining spatial density and cell-type information into a composite measure that was then correlated to gene expression, or (2) performing two separate CCA analyses (correlating gene expression against either spatial density or cell type) and combining them. In the first CCAbased approach, we combined spatial density and cell-type labels by learning a Gaussian kernel density function only on cells labeled as granule cells and then inferring its value for other cells.
This score was then used in CCA. In the second CCA-based approach, where we integrated results from two preliminary CCA runs, the combined cell loadings were computed as the average of the normalized cell loadings from the two CCAs, with the final genes scores then computed by a matching pursuit technique 45,46 : the final CCA score of a gene was the dot product of the CCA cell loadings and the gene's expression vector. In our evaluations, the first CCA-based approach performed comparably or worse than the second, and the results for only the latter are presented in this paper.
We also needed to adapt SpatialDE and Trendsceek, both of which have unsupervised formulations, to select for genes whose expression shows spatial variation in granule cell types but not in non-granule cell types. To do so, we ran them separately on granule and non-granule cells and then ranked genes based on the difference of gene ranks between the two runs.

Genomic location and accessibility inform variability in expression
We featurized chromatin accessibility by computing levels of accessibility at different locations and window sizes relative to the transcription start and end sites of a gene. For example, the feature value rbf_20e3 for a gene was computed as follows: we first assigned weights to each ATAC-seq peak in the data, with a peak d base pairs upstream of the gene's transcription start site (TSS) in the chromosome given the weight profiles. We used Schema to find a set of feature weights that maximized correlation between distances in the primary and secondary modalities. Each Schema run consisted of multiple subruns over an ensemble of parameter choices; the final feature weights were taken as the average across these sub-runs.
To compute feature weights using ridge regression, we used the features in the primary modality as explanatory variables and summarized each gene's expression variability as the standard deviation of the corresponding expression vector in the secondary modality, setting this as the response variable. Schema and ridge regression both assign the highest weights to the rbf_10e6 and rbf_behind_10e6 features ( Figure 3C and Supp Table 3). Compared to ridge regression, Schema offers the advantage of being able to correlate the primary modality with a multi-dimensional secondary modality rather than just a single summary statistic. Also, the constrained optimization framework that limits distortions of the primary modality acts as a regularization, enhancing the stability of Schema's results. To evaluate this stability, we divided genes into subsets (by their strand orientation and chromosome); by applying Schema and ridge regression on each subset independently. We then computed the standard error in estimates of feature weight and the corresponding t-statistic, allowing us to compare the two approaches (Supp Table 3).
To test the null hypothesis that the genes in the set were dispersed randomly across TADs, we formulated two independent tests: 1) the frequency of gene pairs that co-occupy a TAD is indistinguishable from random, and 2) given the count of genes in each TAD, the rate parameter of the exponential distribution fitted to these TAD occupancy counts is indistinguishable from random. By permutation tests where we randomized gene-TAD assignments, we were able to reject the null hypothesis for the top quartile of genes (i.e. the most variable) in both cases, with Bonferroni-corrected p < 0.004. The trend continues across the remaining three quartiles, with the TAD clustering scores for the bottom quartile being the lowest ( Figure 3D).

Schema reveals CDR3 segments crucial to T-cell receptor binding specificity
When estimating location-specific selection pressure (Figure 4C, D), we truncated CDR3 sequences to the first 20 residues (sequences longer than that constituted less than 0.2% of our dataset). The i th element of the primary modality feature vector was the 1-letter code of the amino acid at the i th sequence position or a null value if the sequence length was shorter than i.
We defined the distance between two sequences as the number of elements that disagreed. In the original space, this corresponds to the Hamming distance; in the transformed space, it is a location-weighted version of the Hamming distance. Each Schema run was an ensemble of subruns, with varying parameter choices of minimum correlation between the original and transformed datasets and the maximum allowed feature weights. Feature weights produced in each sub-run were normalized by linearly mapping the lowest weight to 0 and the highest to 1.
We then averaged these normalized feature weights across sub-runs. To compute a location's score using VDJdb, we extracted the VDJdb-provided relative entropy score (I.norm) for the location in each TCR motif and averaged it across all motifs in the database. Here, Schema and VDJdb scores have opposite orientations: for a location that demonstrates low variability, the associated Schema weight will be high while the VDJdb score will be low. Therefore, when comparing the Schema and VDJdb scores, we inverted the orientation of Schema scores by subtracting them from 1 (Figure 4C, D).
The comparison of per-location scores between Schema and VDJdb is complicated by length differences between motifs in VDJdb and sequences in our dataset: the former contains shorter sequences, with the average sequence length of α and β chain motifs in VDJdb being 11.9 and 12.5, respectively; the corresponding averages in our dataset are 13.5 and 14.5. However, both datasets have good coverage of locations 1-10 and the per-location scores are in broad agreement there. (Figure 4C, D).
To compute the selection pressure on amino acids, we focused on segments 3-7 in TCR α chains and 5-11 in TCR β chains, choosing these locations for their high sequence variability as estimated by Schema and VDJdb above. To compute Schema scores, an ensemble of sub-runs was performed, and as described above, Schema scores were normalized. VDJdb scores for an amino acid were computed as the average frequency-weighted relative entropy scores (height.I.norm) across the selected locations in all TCR motifs in the database.
To exemplify how Schema can synthesize additional modalities, we also incorporated proteomic measurements of 12 cell-surface markers. Hypothesizing that cell-surface protein levels should be unrelated to the V(D)J recombination variability, we added a low weight term to Schema's objective function that penalized correlation between distances in the CDR3-sequence space and distances in proteomic-measurement space. Across subsets of the dataset split by donors (4 subsets) or by epitopes (10 randomly divided subsets), we compared the baseline twomodality setup against the new three-modality setup and found that the latter produced slightly more stable results than the former, with smaller standard deviations of Schema-computed weights across the subsets of data (0.094 vs 0.101 for the donor split, and 0.164 vs 0.166 for the epitope split). In general, we recommend the use of cross-validation or an independent metric to calibrate the relative weights of secondary modalities in such use-cases.

Schema highlights secondary patterns while preserving primary structure
We chose gene expression as the primary modality, reducing it with non-negative matrix factorization (NMF) to the top 50 components, and used temporal metadata as the secondary modality. We estimated differential pseudotime using the implementation in Scanpy 36 of Haghverdi et al.'s 59 algorithm.

Schema can synthesize spliced and unspliced RNA counts to accentuate cell differentiation
We normalized the spliced and unspliced counts, log transformed them and reduced them to their top 100 principal components. These were specified to Schema as the primary and secondary modalities, respectively.
To construct a pseudotime estimate from Schema's output, we first computed the mean per-cell difference between the transformed and original RNA-seq data. Interpreting this difference as the major axis of transcriptional change, we projected the original RNA-seq values on it. The magnitude of projection for each cell is a score that we interpreted as a pseudotime measure.

Code Availability
Python source code, a tutorial, and some examples demonstrating Schema's use are available at https://github.com/rs239/schema . The program is also available as the Python package "schema_learn" that can be installed using the Python package installer, pip.

Data Availability
We used the following publicly available datasets. Below, GEO refers to the Gene Expression Omnibus repository (https://www.ncbi.nlm.nih.gov/geo/): • Slide-seq data from Rodriques et al. 10 o https://singlecell.broadinstitute.org/single_cell/study/SCP354/slide-seq-study o Processing code at: https://github.com/broadchenf/Slideseq/tree/master/BeadSeq%20Code  (A) Schema is designed for assays where multiple modalities are simultaneously measured for each cell. The researcher designates one highconfidence modality as the primary (i.e., reference) and one or more of the remaining modalities as secondary. (B) Each modality's observations are mapped to points in a multi-dimensional space, with an associated distance metric that encapsulates modality-specific similarity between observations. Across the three graphs, the dashed and dotted lines indicate distances between the same pairs of observations. (C) Schema transforms the primary-modality space by scaling each dimension so that the distances in the transformed space have a higher (or lower, as desired) correlation with corresponding distances in the secondary modalities; arbitrary distance metrics are allowed for the latter. Importantly, the transformation is provably guaranteed to limit the distortion of the original space, thus ensuring that information in the primary modality is preserved. (D) The new point locations represent information synthesized from multiple modalities into a coherent structure. To compute the transformation, Schema weights features in the primary modality by their importance to its objective; we have found this feature-selection aspect very useful in biological interpretation of its results. (A) Leiden clustering 50 of cellular profiles results in greater ARI agreement with ground-truth cell type labels when featurizing cells by RNA-seq profiles alone compared to featurizing with ATAC-seq profiles alone. ATAC-seq does provide relatively more information when distinguishing PT cells. (B) Ground truth labels from Cao et al. 6 (C, D) To assess the ground-truth accuracy of Leiden clustering, we assigned each cluster to the cell type most frequently seen in the ground-truth labels of its members. Clusters where labels are more mixed will thus have lower accuracy. Clustering on RNA-seq profiles alone results in many PT cells assigned to such clusters. (E) Schema synthesis of RNA-and ATAC-seq features, followed by Leiden clustering, results in significantly greater concordance with groundtruth cell types when compared to Leiden clustering on the RNA-seq features alone (One-sided binomial test, p = 6.7 x 10 -15 ). (F) ARIs of clusters from synthesized data are higher, especially for PT cells. Synthesizing the modalities using canonical correlation analysis (CCA) or a "pseudocell" approach described in the original study (see Methods) results in lower ARI scores. Weights computed by Schema after accounting for batch effects and developmental age with two different distance metrics, one that provides Schema with temporal-ordering and another that does not provide this order. When incorporating order information, Schema down-weights PC5, which shows substantial within-timepoint, batch-related variability, and up-weights PC9, which has higher correlation with time. Correspondingly identified PCs reflect the effect of these metric. (D, E) Schema identifies genes with monotonically changing expression. For each gene identified by Schema or MOFA+, we regressed its expression (normalized to zero mean and unit standard deviation) against developmental time, encoding stages E6.25, E7.0 and E7.25 as timepoints 1, 2 and 3, respectively. Consistent with stage-dependent monotonicity in expression, the fitted slopes for Schema genes were significantly different from zero (two-sided t-test, p = 3.83 x 10 -6 ); this was not true of MOFA+ (p = 0.77). (F) Schema has stronger overlap with batch-effect adjusted benchmark sets of differentially expressed genes (hypergeometric test with Bonferroni correction, p = 5.9 x 10 -12 for the benchmark set of size 188). (A) We analyzed a multi-modal dataset from 10x Genomics 5 to understand how a T-cell receptor's binding specificity relates to the sequence variability in the CDR3 regions of its α and β chains. The primary modality consisted of CDR3 peptide sequence data which we correlated with the secondary modality, the binding specificity of the cell against a panel of 44 epitopes. We optionally synthesized an additional modality, proteomic measurements of 12 cell-surface marker proteins, as a use-case of incorporating additional information (Methods). (B) We performed two Schema analyses: (B.1) To infer location-wise selection pressure, the feature vector of the primary modality was the CDR3 sequence, with the Hamming distance between two sequences as the metric. (B.2) The second analysis aimed to understand amino acid selection pressure at locations that showed high variability. For each such location, a one-hot encoding of the amino acid at the location was used as the feature vector; we operated an ensemble of Schema runs across various locations. (C, D) Schema identifies sequence locations 3-9 (α chain) and 5-12 (β chain) as regions where sequences can vary with a comparatively modest impact on binding specificity. We compared Schema's scores to statistics computed from motifs in VDJdb. Here, we have inverted the orientation of Schema's weights to align them with the direction of VDJdb weights. (E) Schema and VDJdb agree on the relative importance of amino acids in preserving binding specificity (Spearman rank correlation = 0.74, two-sided t-test p = 2 x 10 -4 ). The low weight assigned to Cysteine is likely due to its infrequent occurrence, making the estimate unreliable.  Schema's results are in agreement with the RNA velocity tool, scVelo. By measuring each cell's Schema transformation, we computed a pseudotime estimate which we found to be significantly correlated with scVelo's latent-time estimate (Spearman rank correlation = 0.716, two-sided t-test p < 10 -128 ). (E-G): Same t-SNE visualizations as above, but with cells colored by their scVelo latent-time, showing that Schema puts cells at similar differentiation stages progressively closer.

Investigation of CCA and Schema cell loadings:
We sorted beads by their loading on the Schema-implied gene scores and investigated how the exposure to secondary modalities (celltype labels and spatial density) varied in this ordering; we repeated the analysis for CCA cell loadings. For k=1,...,99, we selected cells with loadings in the percentile range [k, 100] and computed the frequency of granule-cell labels and the average Gaussian kernel density score of a cell in this set; higher values of these measures indicate stronger agreement with the cell type and spatial density modalities, respectively. In the plot, the size of a point is proportional to k. For both Schema and CCA, the higher cell loadings typically correlate with a higher granule-cell fraction and higher kernel density, as both the methods transform the primary gene-expression modality to align it with the secondary modalities. However, for Schema this relationship plateaus after a point because Schema's regularization mechanism limits the distortion of the primary modality, constraining the extent of match with the secondary modalities. In contrast, the unconstrained framework of CCA produces loadings where the 99 th percentile cell loading has significantly higher spatial density exposure than the 95 th percentile. This may lead to overfitting as CCA computes gene rankings are overly determined by sample-specific artifacts. In contrast, the regularization mechanism of Schema produces gene rankings that are better preserved across samples.
Supp Figure 9 Differential expression analysis while accounting for batch effects and developmental stage: Schema feature-selection results for different weights of the developmental-stage and batcheffect modalities.
The middle column is the one shown in the main text: equal (and opposite) weights for the batcheffects and developmental-stage modalities. The left-most column corresponds to using only the batch-effect modality while the right-most column corresponds to using only the developmentalstage modality.
Supp Figure 10 Evaluation of canonical correlation analysis (CCA) performance (also see Supp Figure 7, 8) (A) Inferring cell types by synthesizing RNA-seq and ATAC-seq data. The metric of evaluation here is the agreement between Leiden clustering on the synthesized dataset and ground truth cell-type labels, measured using the adjusted Rand index (ARI), with higher scores indicating greater agreement. This panel contains the same information as Figure 2E and is reproduced here for convenience. (B) Inferring RNA velocity by synthesizing spliced and unspliced mRNA counts. Spliced and unspliced data were correlated using CCA and the synthesized data was visualized with t-SNE. The bottom half of the panel colors cells by their scVelo latent-time estimate. This panel should be compared with Figure 6B-F, which display the corresponding plots produced by Schema synthesis of the data. CCA's synthesis does not place cells with similar stages of differentiation as closely together as Schema. Spearman rank correlation between t-SNE distances and latent-time difference is 0.163 for CCA, while it is 0.397 when using just the spliced mRNA counts and 0.432 for the Schema transformation corresponding to a minimum correlation constraint of 0.95. (C) Schema highlights secondary patterns while preserving primary structure. RNA-seq data was synthesized with cell age metadata using CCA. Compared to a synthesis by Schema (Figure 5B-D), the CCA-based visualization less clearly communicates the developmental trajectory. We quantified the age-related structure in the transformed dataset by a diffusion pseudotime analysis. The Spearman rank correlation between the pseudotime estimate and the ground-truth cell age is only 0.059 for the CCA-synthesized data while it 0.365 in the original, untransformed dataset and 0.436 in the Schema-transformed dataset corresponding to a minimum correlation constraint of 0.99.
Supp Table 3 Comparison of feature weights estimated by Schema and ridge regression when integrating chromatin accessibility and gene expression data Feature weights estimated by Schema are always non-negative, indicating just the feature's importance in determining a gene's expression variability. In regression, feature weights can be positive or negative, indicating whether the feature is relevant for high or low variability genes. Hence, when interpreting the ridge regression output we only considered the magnitude of a feature's weight and not its sign. To compare the robustness of Schema and ridge regression estimates, we ran both these algorithms on subsets of genes bucketed by their strand orientation and chromosome. The regularization mechanism of Schema helps produce estimates that are more stable. The standard error and t-statistic (i.e., the ratio of Wt to Std Err) were computed from these.

| | 2 1
Lastly, we have, for the right side of the inequality constraint, each coordinate of h: We have no equality constraints in our optimization, so and from (2) are not needed.

Theoretical analysis of Schema's scalability: concentration bounds
Our approach is to show that, given a ̂ that has been calculated based on a random sample, the correlation coefficient between all pairwise distances cannot be too different than the correlation coefficient computed on the sample. To do this, we use Chernoff bounds, which bound how far away a random variable can be from its expectation, on the covariance and variance terms of correlation coefficient given. This gives us a bound on how far away the correlation coefficient on the whole population can be from the one calculated on the sample.
Let be a random subset of all possible interactions. For now, we assume that interactions are chosen uniformly at random. Solving the optimization problem from the above section (Equation (1)) with our sample yields ̂, an estimator for the true optimal transform .
We show that ̂ approximates well by showing that the pairwise distances of ̂( ) have high correlations with the secondary datasets as long as ̂ has high correlations on the subsample.
Formally, we will guarantee, for any , > 0 and sample size at least | | = ( This is a powerful result, made possible by our restriction to scaling transforms, which are easy to analyze. First of all, note that we only need a sample-size logarithmic in our desired confidence level in order to get strong concentration, allowing analysis of massive RNA-seq datasets. To begin our analysis, let be a × psd matrix (in our specific case it will be diagonal, but this analysis will generalize to any psd matrix, which motivates the generalization to all psd matrices in future work). We also assume randomly draw pairwise differences = − , choosing these , uniformly from the set of pairs of points in our primary dataset.
Here, we focus on the correlation between the transformed dataset and the primary dataset.
Analyses for correlations between the transformed data and the secondary datasets will be similar.
Consider the form of the (population) correlation: If, for our samples, we can determine confidence intervals of size 2 for each of the terms , , , , then we can bound the distance away from the correlation on the entire set of pairwise distances. This distance is maximized when is as small as possible, and , , , are as large as possible. So, by expanding using Taylor expansions and removing terms of size Thus, for a desired overall confidence level , the relationship between and is given by: This can be converted into giving a (one-sided) confidence interval of length by substituting the probability on the left with a desired confidence level , and solving for , which gives a statement: where || || is the matrix-norm, i.e. || || = √ ( ). So for a diagonal matrix, || || 2 = ∑ 2 . We can bound || || ≤ || − || ≡ ( ) .
To get a confidence interval of size , we plug into (4), so we require: 2 Note that the diameter is an extremely coarse bound for the above bound. Morally, one can replace "diameter" with "variance", and the user has control over || || by choice of hyperparameters. We also note that the sample complexity improves drastically if we focus only on local distances, a future area of exploration.
The same analysis can be used for terms and in (3), but the dependency on the diameter is not as bad for those terms, so term is the worst case. Now, we consider the variance terms and . For term , note: Again, | − [ ]| is bounded by the maximum squared distance in the dataset 2 ( ), so we can use the Hoeffding inequality from above in the same way.
And term takes the same form as above, but with instead of . As noted in (5), this is a bounded random variable as well, but here with bound || || 2 4 ( ).
Thus, in order to get a uniform confidence interval across all the terms, we require: Supp Text 4

Enriched functions and pathways for differentially expressed genes in dense granule cells
The densely-packed granule cell genes identified by Schema are strongly enriched for signal transmission, potentially indicating greater neurotransmission activity within these cells.