Abstract
Data integration of single-cell measurements is critical for our understanding of cell development and disease, but the lack of correspondence between different types of single-cell measurements makes such efforts challenging. Several unsupervised algorithms are capable of aligning heterogeneous types of single-cell measurements in a shared space, enabling the creation of mappings between single cells in different data modalities. We present Single-Cell alignment using Optimal Transport (SCOT), an unsupervised learning algorithm that uses Gromov Wasserstein-based optimal transport to align single-cell multi-omics datasets. SCOT calculates a probabilistic coupling matrix that matches cells across two datasets. The optimization uses k-nearest neighbor graphs, thus preserving the local geometry of the data. We use the resulting coupling matrix to project one single-cell dataset onto another via a barycentric projection. We compare the alignment performance of SCOT with state-of-the-art algorithms on three simulated and two real datasets. Our results demonstrate that SCOT yields results that are comparable in quality to those of competing methods, but SCOT is significantly faster and requires tuning fewer hyperparameters. The code is available at https://github.com/rsinghlab/SCOT
1 Introduction
Single-cell measurements provide a fine-grained view of the heterogeneous landscape of cells in a sample, revealing distinct subpopulations and their developmental and regulation trajectories across time. The availability of measurements capturing various properties of the genome, such as gene expression, chromatin accessibility, DNA methylation, histone modifications, and chromatin 3D conformation, has increased the need for data integration methods capable of combining these disparate data types.
Despite the importance of this task, the heterogeneity among single cells presents unique challenges. For example, due to technical limitations, it is hard to obtain multiple types of measurements from the same individual cell. Furthermore, when we measure very different properties of a cell—such as its transcriptional and 3D chromatin profiles—we cannot a priori identify correspondences between features in the two domains. Accordingly, integrating two or more single-cell data modalities requires methods that do not rely on either common cells or common features across the data types. This property of the data prevents the application of some existing single-cell alignment methods because they require some correspondence information, either among the cells or the features [1–4].
Some approaches have tried to align datasets in an entirely unsupervised fashion. One of the earliest attempts, the joint Laplacian manifold alignment (JLMA) algorithm, constructs eigenvector projections based on local k-nearest neighbor graph Laplacians of the data [5]. The generalized unsupervised manifold alignment (GUMA) [6] algorithm seeks a one-to-one correspondence between two datasets based on a local geometry matching term. Liu et al. [7] showed that these methods do not perform well on the single-cell alignment task. Specifically, the GUMA implementation was non-trivial to run, and JLMA gave poor a performance and did not scale well to larger values of k.
Liu et al. [7] proposed a manifold alignment algorithm based on the maximum mean discrepancy (MMD) measure, called MMD-MA, which can integrate different types of single-cell measurements. Another method, UnionCom [8], performs unsupervised topological alignment for single-cell multi-omics data. MMD-MA aims to match the global distributions of the datasets in a shared latent space, whereas UnionCom emphasizes learning both local and global alignments between the two distributions. Neither method requires any correspondence information either among samples or among the features of the different datasets. The papers demonstrate the state-of-the-art performance of the algorithms on simulated and real-world datasets. Although these results are encouraging, both MMD-MA and UnionCom require that the user specify four hyperparameters. In practice, selecting these hyperparameter values can be difficult and time-consuming in an unsupervised setting.
An emerging number of applications across different research areas [9,10] are using optimal transport to learn a mapping between different data distributions. Optimal transport finds the most cost-effective way to move data points from one domain to another. One way to think about optimal transport is as the problem of moving a pile of sand to fill in a hole through the least amount of work. The optimal transport framework has been used in biological applications. Schiebinger et al. [11] use optimal transport to study how gene expression changes over time; they use regularized unbalanced optimal transport to compute differences in gene expression from one time point to the next. ImageAEOT [12] maps singlecell images to a common latent space through an autoencoder and then uses optimal transport to track cell trajectories. Its related work [13] uses autoencoders and optimal transport to learn transport maps between multiple domains. However, the application of this method to single-cell datasets requires some form of supervision, like class labels, to preserve the underlying structure during transport.
The classical optimal transport method requires datasets to be in the same metric space and is hard to implement for domains in different dimensions. Mémoli et al. [14] generalized optimal transport by using the Gromov-Wasserstein distance that compares metric spaces directly instead of comparing samples across spaces. In the natural language processing community, Alvarez et al. [10] used this approach to measure similarities between pairs of words across languages. They created uniform probability distributions on words in each language and used Gromov Wasserstein-based optimal transport to compute the distances between languages. As far as we are aware, the only biological application of Gromov-Wasserstein optimal transport comes from [15], that uses it to reconstruct the spatial organization of cells from transcriptional profiles. This approach assumes that the data consists of cells that were originally connected in tissue and that closer cells share similar transcriptional profiles but that the original spatial context and relationships among cells have been lost. With this setup, Nitzan et al. [15] use Gromov-Wasserstein optimal transport to map the cells to physical locations that preserve distances in the expression space.
In this paper, we present Single-Cell alignment using Optimal Transport (SCOT), an unsupervised learning algorithm that uses Gromov Wasserstein-based optimal transport to align single-cell multi-omics datasets (presented schematically in Figure 1). Like UnionCom, SCOT aims to preserve local geometry when aligning single-cell data. The algorithm achieves this by constructing a k—nearest neighbor graph for each dataset. SCOT then finds a probabilistic coupling between the samples of each dataset that minimizes the distance between the graph distance matrices produced by the k-NN graph. Finally, it uses the coupling matrix to project one single-cell dataset onto another through barycentric projection, thus aligning them. Unlike MMD-MA and UnionCom, our algorithm requires tuning of only two hyperparameters and is robust to the choice of one. We compare the alignment performance of SCOT with MMD-MA and UnionCom on three simulated and two real-world datasets. We demonstrate that SCOT aligns all the datasets as well as the state-of-the-art methods and converges ~15 and ~5Ü times faster than MMD-MA and UnionCom, respectively.
2 Method
SCOT uses Gromov Wasserstein-based optimal transport, which preserves local neighborhood geometry when moving data points. The output of this transport problem is a matrix of probabilities that represent how likely it is that data points from one space correspond to data points in the other space. These probabilities can then be used to project the data into the same space for alignment. In this section, we first introduce the formulation of optimal transport followed by its extension using the Gromov-Wasserstein distance. Finally, we present the details of our SCOT algorithm.
Without loss of generality, we present the case for two datasets. Let the two sets of points be X = (x1, x2,…, xnx) from and Y = (y1, y2,…, yny) from . The datasets have nx and ny points, respectively. We do not require any correspondence information across the datasets but assume that there is some underlying shared structure so that the datasets can be aligned.
Optimal Transport
The Kantorovich optimal transport problem seeks to find a minimal cost mapping between two probability distributions [16]. Referring back to the problem of moving a sand pile to fill in a hole, Kantorovich optimal transport allows us to split the mass of a grain of sand instead of moving the whole grain. For probability measures μ and ν defined on and , respectively, this optimal transport problem learns a minimal coupling π that attains where c(x, y) is a cost function and Π(μ, ν) is the set of couplings of μ and ν given by
Intuitively, the cost function says how many resources it will take to move x to y, and the coupling π assigns a probability that x should be moved to y for each x and y in the two spaces. Note that when the spaces of interest are both the same metric space with set , distance d, and cost function c(x, y) = d(x, y)p, then the optimal transport distance (Equation 1) is equivalent to the p—th Wasserstein distance:
Wasserstein distances measure the distances between probability distributions on a metric space and are commonly used in machine learning applications.
Since we want to align data, we work with discrete measures p and q over our data points, which we can write as where δxi is the Dirac measure. Then, the cost function is given as a matrix , e.g. Cij = ǁxi — yjǁ, and the set of possible couplings are the matrices
A discrete coupling Γ relates two measures p and q in a meaningful way: Each row Γi tells us how to split the mass of data point xi onto the points yj for j = 1,…, ny, and the condition requires that the sum of each row Γi is equal to the probability of sample xi. The discrete optimal transport problem attempts to find a coupling that minimizes the cost of moving samples through the linear program:
Although this problem can be solved with minimum cost flow solvers, it is usually regularized with entropy for more efficient optimization and empirically better results [17]. The addition of entropy diffuses the optimal coupling, meaning that more masses will be split. Thus, the optimal transport problem that is solved numerically is where ϵ > Ü and H(Γ) is the entropy defined by
Equation 6 is a strictly convex optimization problem, and for some unknown ovectors and , the solution has the form Γ* = diag(u)Kdiag(v), with , element-wise. This solution can be obtained efficiently via Sinkhorn’s algorithm, which iteratively computes where ⊘ denotes element-wise division. This derivation immediately follows from solving the corresponding dual problem for Equation 6 [16].
Gromov-Wasserstein distance
Classic optimal transport requires defining a cost function across domains, which can be difficult to implement when the domains are in different dimensions. Gromov-Wasserstein distance extends optimal transport by comparing distances between samples rather than directly comparing the samples themselves [10]. For this extension we need to assume we have metric measure spaces and , where dx and dy are distances on and , respectively [14]. Instead of defining a cost function between spaces as in classic optimal transport, Gromov-Wasserstein uses the difference between pairwise distances. Specifically, given a cost function , the Gromov-Wasserstein distance between μ and v is defined by
The main change from basic optimal transport (Equation 1) to Gromov-Wasserstein (Equation 9) is that we consider the effect of transporting pairs of points rather than single points. Intuitively, L(dx(x1, x2), dy(y1, y2)) now captures how transporting x1 onto y1 and x2 onto y2 would distort the original distances between x1 and x2 and between y1 and y2. This change ensures that the optimal transport plan π will preserve some local geometry. In the case of , Gromov-Wasserstein is a distance on the space of metric measure spaces [14].
For the discrete case, we can compute pairwise distance matrices Dx and Dy as well as the fourth order tensor , where . The discrete Gromov-Wasserstein problem is then defined by
For each tuple (xi,xk,yj,yl), we are computing the cost of altering the pairwise distances between xi and xk when splitting their masses to yj and yl by weighting them by Γij and Γkl, respectively. The summation can also be expressed as the inner product 〈L(Dx, Dy) ® Γ, Γ〉. Equation 10 is now both nonlinear and non-convex and involves operations on a fourth-order tensor, including the operation tensor product L(Dx,Dy) ® Γ for a naive implementation. Peyré etal. showthatfor some choices of loss function this product can be computed in cost [18]. In particular, for the case L = L2, the inner product can be computed by
As in the classical optimal transport case, the coupling matrix can then be efficiently computed for an entropically regularized optimization problem:
Larger values of ϵ lead to an easier optimization problem but also lead to a denser coupling matrix, meaning that more data points exhibit significant correspondences with one another. Smaller values of ϵ lead to sparser solutions, meaning that the coupling matrix is more likely to find the correct one-to-one correspondences for datasets where there are one-to-one correspondences. However, it also yields a harder (more non-convex) optimization problem [10].
Peyré et al. [18] propose using a projected gradient descent approach for optimization, where both the projection and the gradient are taken with respect to Kullback-Leibler divergence. These projections are computed via Sinkhorn iterations. Algorithm 1 presents the algorithm for L = L2.
Single-Cell alignment using Optimal Transport (SCOT)
Our method, SCOT, works as follows: First, we compute the pairwise distances on our data by using a geodesic distance as in [15]. To do this, we construct a k-nearest neighbor graph weighted by Euclidean distances within each dataset. Then we compute the shortest path distance on the graph between each pair of nodes. We set the distance of any unconnected nodes to be the maximum (finite) distance in the graph and rescale the resulting distance matrix by dividing by the maximum distance for numerical stability. Our approach is robust to the choice of k (Supplementary Section 1.4).
Since we do not know the true distribution of the original datasets, we follow [10] and set p and q to be the uniform distributions on the data points. With these graph distance matrices and marginal distributions, we solve for the optimal coupling Γ which minimizes Equation 12. To implement this method, we use the Python Optimal Transport toolbox (https://pot.readthedocs.io/en/stable/) [19]. The Sinkhorn iterations can often be unstable for small values of ϵ due to division by K, so we use the log stabilized version of the Sinkhorn iterations as proposed by [20, 21].
One of the major advantages of using Gromov-Wasserstein to align datasets is that we end up with a coupling matrix Γ with a probabilistic interpretation. In particular, the entries of the normalized row nxΓi are the probabilities that the fixed data point xi corresponds to each yj. However, to use the correspondence metrics previously used in the field to evaluate the alignment, we need to project the two datasets into the same space. The Procrustes approach proposed in in [10] does not generalize to datasets with different feature and sample dimensions, so we use a barycentric projection:
This barycentric projection of point xi is a weighted average of the yj’s, where the weight Γij is the probability of correspondence between xi and yj. This projection averages over all the points. Thus, it has a tendency to center the projected data onto the mean of the dataset it is being projected on. Figure 1 presents the schematic of the SCOT algorithm.
3 Experimental Setup
Simulated datasets
We follow Liu et al. [7] and benchmark our method on three different simulation schemes 1. All three simulations contain two domains with 300 samples that have been projected non-linearly to 1000- and 2000-dimensional feature spaces, respectively. The first simulation is a bifurcated tree in two-dimensional space. The second simulation maps the branching structure onto a Swiss roll in three-dimensional space. The third simulation is a circular frustum in three-dimensional space (Supplementary Figure S1). Simulations are generated with known sample-wise correspondences, which are used to benchmark methods and evaluate their performance in recovering them. We Z-score normalize the features of all simulation datasets before running the alignment algorithms.
Single-cell multi-omics datasets
We use two sets of single-cell multi-omics data to demonstrate the applicability of our model to real-world biological datasets. Both datasets are generated by co-assays; thus, we have known cell-level correspondence information for use in benchmarking.
The first dataset is generated using the sc-GEM assay [22], which simultaneously profiles gene expression and DNA methylation. The dataset (Sequence Read Archive accession SRP077853) is derived from human somatic cell samples undergoing conversion to induced pluripotent stem cells (iPSCs). This dataset was also used by Cao et al. [8] to demonstrate the performance of their UnionCom algorithm. The data dimensions are 177 × 34 for the gene expression data and 177 × 27 for the chromatin accessibility data.
The second dataset is generated by SNARE-seq [23], which links chromatin accessibility with gene expression. The data (Gene Expression Omnibus accession GSE126074) is derived from a mixture of human cell lines: BJ, H1, K562, and GM12878. We pre-processed the datasets following Chen et al. [23], as follows. We reduced data sparsity and noise in the ATAC-seq data by performing dimensionality reduction using the topic modeling framework cisTopic [24]. The dimensions of the RNA-seq data were reduced using PCA. The resulting input matrices for the SNARE-seq data were of size 1047 × 19 and 1047 × 10 for ATAC-seq and RNA-seq, respectively. Similar to the simulation datasets, we Z-score normalized all real-world datasets.
Baselines
We compare SCOT with the two state-of-the-art unsupervised single-cell alignment methods MMD-MA [7] and UnionCom [8]. Note that none of these methods use any correspondence information for aligning the datasets.
Hyperparameter tuning
To select hyperparameters, we ran each method over a grid of hyperparameters and selected the setting that yielded the maximal average FOSCTTM. For SCOT, the grid covers the regularization weight ϵ ∈ {0.00001, 0.0001, 0.0002, 0.0003,..., 0.10 and number of neighbors k ∈ {5,10, 20, 30, 40,... n}, where n is the number of samples in the dataset. MMD-MA has four parameters to tune: the width σ ∈ {0.01, 0.1,1.0, 10} of the Gaussian for the initial kernel calculation, the weights λ1, λ2 ∈ {10-3,10-4,10-5,10-6,10-7} for the terms in the optimization problem, and the dimensionality p ∈ {3, 4, 5} of the embedding space. UnionCom also requires the user to specify four hyperparameters: the number k ∈ {5,10, 25,..., n} (with increments of 25 after k = 25) of neighbors in the graph, the dimensionality p ∈ {2,5,10} of the embedding space, the trade-off parameter β ∈ {0.001, 0.005, 0.01, 0.5, 0.1, 0.5,1,5,10} for the embedding, and a regularization coefficient p ∈ {0.001, 0.005, 0.01, 0.5, 0.1, 0.5,1, 5,10}. While not related to the algorithmic formulation of
UnionCom, we also tuned the learning rate to achieve smoother convergence. We present alignment and runtime results for the best performing hyperparameters of SCOT, MMD-MA, and UnionCom.
Evaluation metrics
All datasets have one-to-one sample-level correspondence information. We use this information solely to quantify the alignment performance of SCOT and the baselines. We use the evaluation metric previously introduced by Liu et al. [7] called “fraction of samples closer than the true match” (FOSCTTM). For each domain, we compute the Euclidean distances between a fixed sample point and all the data points in the other domain. Next, we compute the fraction of those distances that are closer to the sample than the distance to the true match. Next, we average these values for all the samples to give us an average FOSCTTM score. For perfect alignment, all samples would be closest to their true match, yielding a value of zero. Therefore, a lower average FOSCTTM corresponds to better alignment performance.
For the scGEM dataset, we also adopt a metric used by Cao et al. [8] called “label transfer accuracy.” This metric assesses the alignment performance of the cell label assignment. Specifically, it measures the ability to correctly transfer sample labels from one domain to another based on their neighborhood in the aligned domain. As described in [8], we train a k-nearest neighbor classifier on one of the domains and predict the sample labels in the other domain. The label transfer accuracy is the percentage of correctly predicted labels, so it ranges from 0 to 100%, and higher values indicate better performance.
4 Results
SCOT successfully aligns the simulated datasets
We first compare SCOT’s performance with the baseline methods for the three simulation datasets. In Figure 2, we sort and plot the FOSCTTM score for each sample. We observe that SCOT achieves the lowest average FOSCTTM metric (averaged over all samples in the datasets) and demonstrates its ability to recover the correct correspondences in simulations.
SCOT gives state-of-the-art performance for single-cell multi-omics alignment
Next, we apply our method to real-world single-cell sequencing assays and observe that SCOT gives comparable performance to the baseline methods. For scGEM data, the best FOSCTTM values are 0.201, 0.217, and 0.2066 for MMD-MA, UnionCom, and ScOT, respectively (Figure 3). Since the barycentric projection averages the data together, we observe that the expression data clusters near the mean of the manifold it is projected on (methylation data) in Figure 3(B).
As in [8], we use the label transfer accuracy metric to quantify how well the cells with the same label cluster together after alignment. For k = 5 (the default value used by Cao et al.), the label transfer accuracy values for MMD-MA, UnionCom, and SCOT are 0.5876, 0.5311, and 0.5650, respectively, when the chromatin accessibility dataset is used as the training set. For the training set comprised of gene expression, the values are 0.6384, 0.4689, and 0.6554, respectively. We report results for other values of k for 1 ≤ k ≤ 8 in the Supplementary Figure S2.
Next, we compare all three methods for the SNARE-seq dataset (Figure 4). This dataset consists of a larger number of cells (1047) compared to scGEM (177). MMD-MA yields the best FOSCTTM, followed by SCOT, with average FOSCTTM values of 0.1499 and 0.1985. UnionCom achieves lower performance with an average FOSCTTM value of 0.265.
A primary difference between MMD-MA and UnionCom versus SCOT is that, rather than projecting both the datasets to a lower-dimensional space, our method projects one dataset onto the other. To test whether the direction of the embedding matters, we ran SCOT in both directions for all datasets. In each case, we do not observe significant difference in performance between the two directions, with average FOSCTTM values of 0.0712 (Sim 1), 0.0063 (Sim 2), 0.0084 (Sim 3), 0.2220(scGEM), and 0.2281 (SNARE-seq).
SCOT is faster than other alignment algorithms
We directly compared the running times of SCOT with the baseline methods for the best performing hyperparameters. We ran the CPU versions of the algorithms on an Intel i5-8259U CPU (base frequency 2.30GHz) with 16GB memory. UnionCom also has a GPU version that we ran on a single NVIDIA GTX 1080ti with VRAM of 11GB. We observe that SCOT converges ~15, ~50, and ~10 times faster than MMD-MA, UnionCom, and UnionCom-GPU, respectively, for the largest SNARE-Seq dataset (Table 1).
5 Discussion
Integrating different single-cell modalities is an important task with challenges that require development of effective alignment algorithms. We have demonstrated that SCOT, which uses Gromov Wasserstein-based optimal transport to perform unsupervised integration of single-cell multi-omics data, performs well when compared to two state-of-the-art methods but in less time and with fewer hyperparameters.
To apply an evaluation metric and quantify the quality of alignment, we need to project the data into the same space. Here, we choose to use a barycentric projection to project one domain onto another, but there are various other ways to use the coupling matrix to infer alignment. For example, the coupling matrix could also be used with other dimension reduction methods such as t-SNE (as in UnionCom) to align the manifolds while embedding them both into new spaces. Additionally, depending on the application, a projection may not be required. For some downstream analyses, it may be sufficient to have probabilities relating the samples to one another. Our future work will focus on developing effective ways to utilize the coupling matrix and extend our framework to handle more than two alignments at a time.
We demonstrated the relative speed of convergence of SCOT. This speed benefit is further enhanced by the fact that, unlike MMD-MA and UnionCom which require tuning of four parameters, SCOT requires tuning of only two parameters. We also show (Supplementary Section 1.4) that SCOT is robust to the choice of k. In this way, SCOT dramatically reduces the hyperparameter search space, making application of the algorithm faster and easier.
1 Supplementary Information
1.1 Simulation Data Sets
1.2 Barycentric Projections in Both Directions
1.3 Label Transfer Accuracy for scGEM Data Set
Acknowledgments
We are grateful to Yang Lu, Jean-Philippe Vert, and Marco Cuturi for helpful discussion of Gromov-Wasserstein optimal transport.