## Abstract

Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell **i**nterpretation via **m**ulti-kernel **l**ea**r**ning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. We show that SIMLR separates subpopulations more accurately in single-cell data sets than do existing dimension reduction methods. Additionally, SIMLR demonstrates high sensitivity and accuracy on high-throughput peripheral blood mononuclear cells (PBMC) data sets generated by the GemCode single-cell technology from 10x Genomics.

## Background

Single-cell RNA sequencing (scRNA-seq) technologies have recently emerged as a powerful means to measure gene expression levels of individual cells and to reveal previously unknown heterogeneity and functional diversity among cell populations [1]. Quantifying the variation across gene expression profiles of individual cells is key to the identification and analysis of complex cell populations that arise in neurology [2], immunology [3], oncology [4] and developmental biology [5]. The heterogeneity identified across individual cells can answer questions irresolvable by traditional ensemble-based methods, where gene expression measurements are averaged over a population of cells pooled together [6], [7]. Recent studies have demonstrated that *de novo* cell type discovery and identification of functionally distinct cell subpopulations are possible via unbiased analysis of all transcriptomic information provided by scRNA-seq data [8]. Therefore, unsupervised clustering of individual cells using scRNA-seq data is critical to developing new biological insights and validating prior knowledge.

Many existing single-cell studies employ computational and statistical methods that have been developed primarily for analysis of data from traditional bulk RNA-seq methods [8]–[10]. These methods do not address the unique characteristics that make single-cell expression data especially challenging to analyze: outlier cell populations, transcript amplification noise, and biological effects such as the cell cycle [11]. In addition, it has been shown that many statistical methods fail to alleviate other underlying challenges, such as dropout events, where zero expression measurements occur due to sampling or stochastic transcriptional activities [12]. Recently, new single-cell platforms such as DropSeq [13], InDrop [14] and GemCode single-cell technology [15] have enabled a dramatic increase in throughput to thousands of cells. These platforms have adapted recent sequencing protocols such as unique molecular identifiers (UMIs) to create digital counts of transcripts in a cell. However, with 3’-end sequencing instead of full transcript sequencing, low coverage per cell and varying capture efficiency, these high-throughput platforms produce data sets where 95% of measurements are zeros.

The technological differences across single-cell platforms as well as the biological differences across studies can strongly affect the usability of unsupervised clustering methods. Core to the problem is that unsupervised clustering methods usually rely on specific similarity metrics across the objects to be clustered, and standard similarity metrics may not generalize well across platforms and biological experiments, and thus be unsuitable for scRNA-seq studies. To address this problem and answer the key question of “which cells are similar or different” in a way that generalizes across different single-cell data sets, we introduce SIMLR (single-cell interpretation via multi-kernel learning), a novel framework that learns an appropriate cell-to-cell similarity function from the input single-cell data. SIMLR simultaneously clusters cells into groups for subpopulation identification and produces a 2-D or 3-D visualization of the expression data, with the same similarity function applied for clustering and low-dimensional projection. As a result, the identified cell subpopulations and their separation are intuitively visualized.

SIMLR offers three main unique advantages over previous methods: (Equation 1) it *learns* a distance metric that best fits the structure of the data via combining multiple kernels. This is important because the diverse statistical characteristics due to large noise and dropout effect of single-cell data produced today do not easily fit specific statistical assumptions made by standard dimension reduction algorithms. The adoption of multiple kernel representations provides a better fit to the true underlying statistical distribution of the specific input scRNA-seq data set; (Equation 2) SIMLR addresses the challenge of high levels of dropout events that can significantly weaken cell-to-cell similarities even under an appropriate distance metric, by employing graph diffusion [16], which improves weak similarity measures that are likely to result from noise or dropout events; (Equation 3) in contrast to some previous analyses that pre-select gene subsets of known function [10], [17], SIMLR is unsupervised, thus allowing *de novo* discovery from the data. We empirically demonstrate that SIMLR produces more reliable clusters than commonly used linear methods, such as principal component analysis (PCA) [18], and nonlinear methods, such as t-distributed stochastic neighbor embedding (t-SNE) [19], and we use SIMLR to provide 2-D and 3-D visualizations that assist with the interpretation of single-cell data derived from several diverse technologies and biological samples.

## Results and Discussion

### Overview of Algorithm

Here we highlight the main ideas in our methodology underlying SIMLR, and we provide full details in **Materials and Methods**.

Given a *N* × *M* gene expression matrix *X* with *N* cells and *M* genes (*N* < *M*) as an input, SIMLR solves for *S* a *N = N* symmetric matrix that captures pairwise similarities of cells. In particular, *S _{i,j}* the (

*i, j*)-th entry of

*S*represents the similarity between cell

*i*and cell

*j*SIMLR assumes that if

*C*separable populations exist among the

*N*cells, then

*S*should have an approximate block-diagonal structure with

*C*blocks whereby cells have larger similarities to other cells within the same subpopulations.

We introduce an optimization framework that learns *S* by incorporating multiple kernels to learn appropriate cell-to-cell distances from the data (**Figure 1a**). We provide an efficient algorithm to optimize for *S* while simultaneously learning the block-diagonal structure within *S* The cell-to-cell similarity values in *S* can be used to create an embedding of the data in 2-D or 3-D for visualization, as well as a projection of the data into a latent space of arbitrary dimension to further identify groups of cells that are similar (**Figure 1b**).

### General optimization framework

SIMLR computes cell-to-cell similarities through the following optimization framework:
where *x*_{i} is the length-*M* gene expression vector of cell *i*, i.e., the *i*th row of *X*; *D*(*x _{i}*,

*x*) is the distance between cell

_{j}*i*and cell

*j*expressed as a linear combination of distance metrics

*D*with weights

_{l}*W*;

_{l}*I*and

_{N}*I*are

_{C}*N*×

*N*and

*C × C*identity matrices respectively, and

*β*and

*γ*are nonnegative tuning parameters. ‖

*S*‖

_{F}denotes the Frobenius norm of

*S*The optimization problem involves solving for three variables: the similarity matrix

*S*the weight vector

*w*and a

*N × C*rank-enforcing matrix

*L*.

The intuition behind the first term in the formula is that the learned similarity *S* between two cells should be small if the distance between them is large. The second term is a regularization term that avoids over-fitting the model to the data. If there are *C* subpopulations, the gene expressions of cells of the same sub-type should have high similarity, and ideally the effective rank of *S* should be *C* Thus, the third term along with the constraint on *L* enforces the low-rank structure of *S*: the matrix (*I _{N}*—

*S*) is essentially the graph Laplacian [20], and the trace-minimization problem enforces approximately

*C*connected components in a similarity graph that consists of nodes representing the cells, and edge weights corresponding to pairwise similarity values in

*S*[20]. The fourth term imposes constraints on the kernel weights to avoid selection of a single kernel; we empirically found that this regularization improves the quality of learned similarity (

**Supplementary Table 1**).

One critical component of this optimization problem is the choice of the distance measure *D*(*x _{i}*,

*x*) between pairs of cells. It is well-known that the distance metric defined for the input space is critical to the performance of clustering and visualization algorithms designed for highdimensional data [21]. Due to the presence of outliers and unusual zero-inflated distributions in single-cell data, standard metrics like the Euclidian distance may fail to perform well. Thus, instead of using a pre-defined distance metric, we incorporate multiple kernel learning [22] that flexibly combines multiple distance metrics.

_{j}We employ an efficient algorithm for optimizing *S*, *L* and *w* in **Algorithm 1** (with full details for each step in **Materials and Methods**). The intuition behind our procedure is simple: holding two of these three variables fixed, the optimization problem over the third variable is convex. Hence, we alternate between optimizing each variable while holding the other two fixed until convergence. Step 4 in **Algorithm 1** is an auxiliary step where we incorporate a similarity enhancement heuristic based on the graph diffusion process (with full details in **Materials and Methods**). The intuition behind diffusion-based similarity enhancement is two-fold: (Equation 1) diffusion processes adds “transitive” similarities between two seemingly dissimilar cells that also have many common neighboring cells with high similarity. This enables SIMLR to alleviate the impact of dropout events and noise typical in scRNA-seq data. (Equation 2) The similarity matrix *S*obtained from the optimization framework may contain arbitrarily weakened entries due to the constraints on the *S* so higher order structures such as local connectivity can be exploited via the diffusion process to improve the numerical stability of the solution.

### Dimension reduction for clustering and visualization

SIMLR relies on the stochastic neighbor embedding (SNE) [23] methodology for dimension reduction, with an important modification: t-SNE computes the similarity of the highdimensional data points using a Gaussian kernel as a distance measure and projects the data onto a lower dimension that preserves this similarity. Instead of using the gene expression matrix as an input to t-SNE, we use the learned cell-to-cell similarity *S* (detailed in **Supplementary Methods**).

For visualization, we use our modified t-SNE algorithm to project the data into two or three dimensions so that the hidden structures in the data can be depicted intuitively. For clustering, we use the same approach to reduce the dimensions to *B* resulting in an *N×B* latent matrix *Z* to which we can apply any existing clustering algorithms, such as **K**-means [24] to assign labels to each cell. The number of reduced dimensions *B* is by default equal to the number of desired clusters *C*.

## Applications

### Learning Cell-to-cell similarities

We start by benchmarking SIMLR against conventional predefined measures in capturing true cell-to-cell similarities on four published single-cell data sets (for the full details of each data set, see **Materials and Methods**):

Eleven cell populations including neural cells and blood cells (Pollen data set [9]).

Neuronal cells with sensory subtypes (Usoskin data set [8]).

Embryonic stem cells under different cell cycle stages (Buettner data set [17]).

Pluripotent cells under different environment conditions (Kolodziejczk data set [10]).

We selected these data sets because they span a variety of cell types and have different numbers of subpopulations, representing a wide range of single-cell data; cell types in each data set were known *a priori* and were further validated in the respective studies, providing a reliable gold standard with which to assess clustering performance (**Table 1**). To evaluate SIMLR’s performance on these data sets, we compared subpopulation labels assigned after dimension reduction with SIMLR to the true subpopulation labels from the respective studies.

SIMLR was given the raw gene expressions and the true number of clusters, but no information about the true labels. Once the similarities were computed, we organized the cells according to the known or validated cell populations (i.e., true labels) from each study. We compared the cell-to-cell similarities learned by SIMLR with a similarity matrix computed from Gaussian kernels applied to Euclidean distances (Euclidean Similarity), and a pairwise correlation matrix (Pearson Correlation) (**Figure 2**). We plotted the resulting symmetric similarity matrices, with colors indicating the known labels, and observed that SIMLR learned a similarity matrix with block structures in remarkable agreement with the previously validated labels, while the other similarity matrices agree less well (**Figure 2**). Both Euclidean distances and Pearson correlations are sensitive to outliers, and Pearson correlations do not capture nonlinear relationships, so spurious similarities between cells from different groups surface with these predefined measures. On the other hand, SIMLR handles outliers with its rank constraint and nonlinear similarities using multiple kernels, making its learned similarity more suitable for single-cell data sets.

Furthermore, SIMLR demonstrates one additional advantage: it can identify additional subpopulation structures even when the number of clusters input into the algorithm is conservatively selected. From the similarity structure learned by SIMLR in the Kolodziejczyk data set we observed that each of the three validated clusters could be further divided into subclusters, which is consistent with the unsupervised analysis in the corresponding study [10]. However, these sub-clusters were identified using a small number of pre-selected genes in the original study. SIMLR preserved these substructures in an unbiased fashion while having clearly removed the spurious similarities between cells from different (validated) groups. In the remainder of our analysis, as gold standard we use only the unbiased and validated true labels as opposed to the additional subpopulations identified by unsupervised analysis.

### Comparison with other clustering methods

SIMLR can also cluster single cells into subgroups to reveal meaningful patterns in scRNA-seq data sets. Clustering methods typically require the number of clusters as an input parameter in order to reveal meaningful grouping in the data. Thus, when applied to clustering, the input parameter *C* to SIMLR can be conveniently assigned to be the number of clusters used as an input for the downstream clustering algorithm.

We set the default clustering application involving SIMLR in the following two stage procedure: (Equation 1) reduce the data to a *C*-dimensional latent space, and (Equation 2) apply K-means clustering with input number of clusters equal to *C* to compute the cluster assignment of each cell. However, because SIMLR is a similarity framework, it can be flexibly adopted in other clustering frameworks that directly take similarities as inputs. So we also provide SIMLR combined with Affinity Propagation (AP) [25] as an example.

When comparing SIMLR’s two clustering applications (SIMLR default and SILMR+AP) with other clustering methods, we assume the number of clusters is known *a priori*. The performance metric we use to compare the results to the true cell identities is the normalized mutual information (NMI), a standard measure of clustering concordance [26]. (The formal mathematical definitions of NMI and other cluster performance metrics are provided in **Supplementary Notes**.)

We found that K-means clustering method with SIMLR consistently outperforms the existing alternatives: Non-Negative Matrix Factorization (NMF) [27], Gaussian Mixture Models (GMM) [28], Dirichlet Process Mixture Models (DPMM) [29], on the four data sets (**Table 2**). These three clustering methods or variation of these methods are model-based algorithms that have been used in recent single-cell studies [30], [31]. Further, we conducted extra experiments with AP which takes similarities as inputs to demonstrate that the similarities learned by SIMLR (used for SIMLR+AP) can significantly outperform the other simpler similarities, such as Euclidean similarity (used for Euc.+AP) and Pearson correlation (used for Corr.+AP). This superior clustering performance is expected from plotting of the similarity matrices (**Figure 2**).

It is important to note that dimension reduction techniques can be used as a pre-processing step for certain clustering methods such as K-means and improve clustering performance. Therefore, the clustering methods we consider above do not explicitly require dimension reduction. We compare directly dimension reduction using SIMLR with other dimension reduction methods in the next subsection.

### Comparison with other dimension reduction methods

To analyse well how SIMLR performs dimension reduction, we compare SIMLR to other dimension reduction methods by comparing the latent space projection produced by SIMLR to the latent space projections produced by other methods. We use K-means to cluster cells in these different latent spaces and assess which latent-space clustering has the greatest concordance with the true clustering. In addition to using the NMI, we also use a supervised approach to compare different dimension reduction results. Following the approach used by [12] to compare different dimension reduction methods, we classify each cell based on the true labels of its nearest neighbors in the dimension-reduced space, and assess the accuracy of this classification using cross validation; we refer to this metric as nearest neighbor error (NNE). This test assesses how accurately new cells can be classified using cells whose labels are already known. We select these two types of performance metrics for dimension reduction specifically because they measure different aspects of the data in the low dimensional latent space: NNE directly reflects how closely cells from the same population are surrounded by each other, whereas NMI provides a global view of how well cells from different populations are separated. We also tested four other performance metrics in **Supplementary Notes**, shown in **Supplementary Figures 2–5**, to ensure that SIMLR performed well under different standards.

We performed extensive comparisons of SIMLR with 8 other dimension reduction methods on the four data sets to test its utility for dimension reduction. The 8 methods included standard linear methods including PCA [18], FA [33], and probabilistic PCA (PPCA) [28]; nonlinear methods including t-SNE [23], Laplacian eigenmaps [25], multidimensional scaling (MDS) [26], and Sammon mapping [27]; and model-based methods specifically designed for single-cell data like zero-inflated factor analysis (ZIFA) [12]. The methods we tested included all methods used in analyses of the original data sets. In addition, the Pollen and the Usoskin data sets were the only ones with validated labels used in [12] to assess ZIFA, which was specifically designed for single-cell data.

The NMI and NNE values for the 9 methods are summarized in **Table 3** and **Table 4** respectively. Our method consistently outperforms the existing alternatives on the four data sets, and most of the differences in NMI and NNE between SIMLR and the second best method are remarkably large.

To test the robustness of SIMLR, we conducted three additional experiments for each data set.

We used varying numbers (3 - 20) of latent dimensions

*B*to evaluate the performance of SIMLR and other methods. This evaluation is critical because typically the true number of clusters in the data is unknown. (As mentioned before, ideally*B*should be equal to the true number of clusters for SIMLR.)We dropped varying fractions (5% - 70%) of the gene measurements in the input gene expression matrix to analyze how each method performs when random levels of dropout are present across the data, which is relevant to the high dropout rate in single-cell data.

We added independent zero-mean Gaussian noise with varying variances

*σ*^{2}(0.1−1) to the gene expression matrix. The number of dimensions*C*used in this experiment is set to be the true number of clusters. In order to preserve the dropout characteristics, we ensured that the added noise was set to zero at a frequency equal to the dropout rate in each data set. Formally, we added a random noise vector to*y*_{i}to*x*_{i}by the following process: where*p*_{0}is the dropout rate (proportion of zeros) in the original expression matrix*X*We set the values of an entry in the new expression matrix to zero if its value after adding noise dropped to zero or below.

The NMI and NNE values (**Figure 3a, b**) on the Buettner data set show remarkably better performance by SIMLR as compared to other methods. In addition, we observe that SIMLR is not sensitive to the number of latent dimensions *B* even though the most suitable choice of *B*should ideally be the number of clusters if it is known. ZIFA and FA achieve high NMI values at certain values of *B* but are less stable and can perform much worse than SIMLR otherwise (as is shown in the first column of **Figure 3**). Moreover, as the fraction of genes increases, the NMI increases and the NNE decreases more noticeably for SIMLR. Finally, while the performance of SIMLR decreases with the noise variance, it still outperforms other methods.

Based on other metrics, including Clustering Accuracy [32] and Purity [33], on all four data sets (shown in **Supplementary Figure 2–5**), we observe that SIMLR also outperforms other methods and no other method dominates in specific regimes.

### Visualization of cells in 2-D

After confirming that cell-to-cell similarities learned by SIMLR are meaningful and that the clustering performance of SIMLR is reliable, we applied SIMLR’s SNE-based dimension reduction and visualisation in 2-D to verify that the structures in the data are visually intuitive. We compared SIMLR with two of the most commonly used visualization methods in single-cell analysis: PCA and t-SNE. We also included ZIFA, which was shown to outperform many other model-based methods [15]. In the resulting visualizations (**Figure 4**), none of the four methods used the true labels as inputs for dimension reduction, and the true label information was added in the form of distinct colors to validate the results.

SIMLR successfully separates the clusters for the Pollen (11 populations) and the Buettner (3 populations) data sets, whereas other methods contain clusters that are mixed to different extents. For the 3 populations in the Kolodziejczyk data set, SIMLR and t-SNE perform similarly and separate the clusters more clearly than PCA and ZIFA. For the 4 populations in Usoskin, none of the methods separated the clusters completely but SIMLR and t-SNE exhibit less overlap than PCA and ZIFA. These results indicate that SIMLR overall uncovers meaningful clusters that are more identifiable than those produced by existing methods. The visualizations of the four data sets using other dimension reduction methods are provided in **Supplementary Figure 1** for reference.

### Application to sparse PBMC data sets

Single-cell RNA-seq data produced from high-throughput microfluidics platforms such as DropSeq [13], InDrop [14] and GemCode single-cell technology [15] contains up to 95% zero expression counts. We tested the performance of SIMLR on sparse data sets by applying it to PBMC scRNA-seq data from the GemCode [15]. The PBMC data were from 5 bead enriched populations: naive B (CD19+ and IgD+), CD56+ natural killer (NK) cells, CD8+ cytotoxic T cells, CD4+ T cells and CD14+ monocytes. The purity of the populations was validated by FACS, and clustering analysis of the scRNA-seq transcriptome profile. To generate a ground truth set, we generated *in silico* mixtures from these 5 data sets at specific proportions: 5%, 10%, 25%, 40%, 20%. SIMLR provided an unbiased classification of these subpopulations that was highly consistent (with NMI over 0.95 on average). In addition, we used t-SNE and PCA with K-means clustering as baselines, and they also produced good agreement with the true labels (**Figure 5a**). The overall improvement of SIMLR over t-SNE and PCA was noticeable but not highly significant because many cell types (such as monocytes, naive B cells) are easily distinguishable from other cell types.

To provide intuition for how SIMLR differs from t-SNE and PCA, we illustrate one of the trials (**Figure 5b-c**, colored by ground-truth cell types) where the 5 different cell types are better separated using SIMLR. PCA produced the most ambiguous visualization across NK cells, CD8+ cytotoxic T cells and CD4+ T cells, whereas t-SNE did not completely separate NK cells from CD8+ cytotoxic T cells. We performed pairwise clustering to elucidate the cases where SIMLR and other methods differ (**Figure 5d** and **Supplementary Figure 7**). While most pairs were easy to separate by all methods, SIMLR’s overall improvement over t-SNE and PCA resulted from its ability to separate CD8+ T cells and NK cells (highlighted in the orange boxes in **Figure 5e**). CD8+ T cells and NK cells are difficult to separate because the two cell types can share several common gene markers and certain T cells share properties of both NK and CD8+ cells [34], [35]. In the case where SIMLR did not outperform PCA (highlighted in the blue boxes in **Figure 5d**), we found that SIMLR mistakenly grouped a very small number of monocytes with NK cells but still correctly separated the vast majority, leading to a negligible difference from PCA.

### Simulation on simulated nonlinear manifold data sets

In addition to real single-cell data sets with ground truth information, we evaluated SIMLR’s performance on simulated data to evaluate how accurately it captured nonlinear manifolds. We focused on comparisons to t-SNE, an algorithm designed to capture nonlinear manifolds, which SIMLR directly adapts by incorporating the learned similarity matrix into the t-SNE framework. We first created artificial clusters of cells (concentric circles) in a low-dimensional latent space (**Figure 6a**), used a random linear transformation to project the data into a high dimensional space, and then added noise and zeros following the model in [12] to create the single-cell gene expression matrix. (More details on the simulation methodology are provided in **Supplementary Notes**.) SIMLR separates the clusters more clearly than t-SNE (**Figure 6b**), as measured by NNE (results were similar with other metrics; **Supplementary Figure 7**) at each stage of the data generation process: prior to adding noise or zeros, after adding noise, and after adding zeros. When more noise (**Figure 6c**) or dropouts (**Figure 6d**) are added to the data, the cluster information in t-SNE becomes obscured, whereas the impact on SIMLR is less significant as the NNE is lower and the clusters are still separated very well.

While in this example we generate clusters as concentric circles in the latent space to illustrate how SIMLR can separate even clusters with complex structures, we also evaluated SIMLR’s performance on Gaussian clusters (**Supplementary Figures 9 and 10**), comparing to factor analysis and ZIFA [12] as well as to t-SNE. We found that SIMLR outperformed the other methods if the dropout rate was low, whereas ZIFA outperformed the other methods if the dropout rate was high; the latter result is unsurprising, because ZIFA is explicitly designed to accommodate a high dropout rate and we generated the data using ZIFA’s dropout model.

## Discussion

High-throughput single-cell RNA sequencing technologies have enabled fine-grained analysis of cell-to-cell heterogeneity and molecular functions within tissues and cell populations that until recently could only be studied in bulk. As additional high-throughput approaches become available for single-cell RNA-seq, a wider range of studies pertaining to fundamental cellular functions will continue to emerge. Consequently, single-cell data sets may exhibit even higher levels of diversity (e.g., different tissues, comparisons of healthy versus disease-associated cell populations, different stages of the cell cycle or across development, different experimental technologies, and other sources of variation across data sets). Such diversity in experimental conditions and cell collections makes it difficult and undesirable to define cell-to-cell similarity measures based on strict statistical assumptions. As novel methodologies and algorithms are urgently needed for this new type of data, SIMLR aims to adapt to the heterogeneity across different single-cell data sets by learning an appropriate similarity measure for each data set.

In this paper we extensively evaluated SIMLR using recently published distinct single-cell data sets and without any prior knowledge of groups of significant genes. We demonstrated that SIMLR is able to learn appropriate cell-to-cell distances that uncover similarity structures that would otherwise be concealed by noise or outlier effects. Furthermore, SIMLR successfully clusters cell populations and projects the high-dimensional data in a visually intuitive fashion. We show that SIMLR separates clusters more cleanly than 8 other popular dimension reduction methods, including linear methods (such as PCA), nonlinear methods (such as Sammon), and a recently published approach (ZIFA) specifically designed for single-cell data sets [12].

Because each dimension reduction algorithm makes its own assumptions, it is unlikely that one is optimal for all data sets. As we have shown, SIMLR performs well on single-cell data sets that contain several clusters-a frequent use case where heterogeneity is defined by distinct cell lineages. But because SIMLR assumes the data has cluster structures, it may not be best suited for data that does not contain clear clusters, such as cell populations that contain cells spanning a continuum. Further, because SNE-based algorithms (such as SIMLR) generally preserve local rather than global structure of the data, they may be best suited for problems where local structure (such as clustering) is of interest rather than continuous global structure (such as progression through pseudotime [36], [37]). Similar to many non-linear methods (such as t-SNE), SIMLR scales quadratically with the number of cells during similarity computation. This scalability limits the utility of SIMLR, as well as most unsupervised methods, when the number of cells is large, even though it runs very quickly on most available single-cell data sets. For data sets with large numbers of cells, it would be computationally tractable to first apply SIMLR to a subset of cells, and then use the labeled cells to train a classifier to group the remaining cells, as previous analyses have done [13][38]. It will be interesting to extend SIMLR in this direction to explore the tradeoff between the number of cells used for the initial clustering and the sensitivity in identifying rare cell populations.

## Materials and Methods

### Four Published Data Sets

We used the following four data sets in our analysis:

Eleven cell populations including neural cells and blood cells (Pollen data set [9]). This data set was designed to test the utility of low-coverage single-cell RNA-seq in identifying distinct cell populations, and thus contained a mixture of diverse cell types: skin cells, pluripotent stem cells, blood cells, and neural cells. This data set includes samples sequenced at both high and low depth; we analysed the high-depth samples, which were sequenced to an average of 8.9 million reads per cell.

Neuronal cells with sensory subtypes (Usoskin data set [8]). This data set contains 622 cells from the mouse dorsal root ganglion, with an average of 1.14 million reads per cell. The authors divided the cells into four neuronal types: peptidergic nociceptors, non-peptidergic nociceptors, neurofilament containing, and tyrosine hydroxylase containing.

Embryonic stem cells under different cell cycle stages (Buettner data set [17]). This data set was obtained from a controlled study that quantified the effect of the cell cycle on gene expression level in individual mouse embryonic stem cells (mESCs). An average of 0.5 million reads were obtained for each of the 182 cells and at least 20% of the reads were mapped to known exons on the mm9 mouse genome. The cells were sorted for three stages of the cell cycle using fluorescence-activated cell sorting, and were validated using gold-standard Hoechst staining.

Pluripotent cells under different environment conditions (Kolodziejczyk data set [10]). This data set was obtained from a stem cell study on how different culture conditions influence pluripotent states of mESCs. This study quantified the expression levels of about 10 thousand genes across 704 mESCs from 9 different experiments involving three different culture conditions. An average of 9 million reads were obtained for each cell and over 60% of the reads mapped to exons on the

*Mus musculus*genome.

For all the data sets above, we applied a logarithmic transformation *f(x)*= log_{10}(*X*+1) to the single-cell raw expression data.

### Five Purified Immune Cell Types in Human PBMC

ScRNA-seq libraries of 5 bead-enriched PBMC populations were generated by 10x Genomics [15]. We computationally sampled a total of 1000 cells randomly from the individual purified populations at the proportion of 5%, 10%, 25%, 40%, 20%, respectively for each *in silico* trial after selecting cells by total UMI counts (**Supplementary Figure 6**).

### Multiple Kernel Learning

Instead of using a predefined distance metric, we incorporate multiple kernel learning in SIMLR to compute the distances between pairs of cells. The general form of the distance between cell *i* and cell *j* is defined as
where *Φ _{l}*(

*c*) is the lth kernel-induced

_{i}*implicit*mapping of the ith cell. This mapping is implicit because we are only concerned about the inner products of the

*Φ*

_{l}(

*c*) and

_{i}*Φ*

_{l}(

*c*) for pairs (

_{i}*i,j*) which is defined as follows: where we only need to compute the kernel functions

*K*(

_{l}*c*,

_{i}*c*). (The kernel of two identical inputs is set to 1 by convention). So we can refine the optimization problem to include the distance in the objective and the corresponding weights as variables as follows:

_{j}We describe the optimization of *L, w* and *S* below; we describe selection of the parameters *β,γ* and *ρ* in the **Supplementary Methods**.

### Solving the Optimization Problem

Initialization of *S, w* and *L* The weight of multiple kernels, *w*, is initialized as an uniform distribution vector, , where *G* is the number of kernels. The similarity matrix *S* is initialized as *S _{ij}* = ∑

_{l}

*w*(

_{l}K_{l}*c*,

_{i}*c*) is initialized as the top

_{j}*B*eigenvectors of

*S*.

The optimization problem formulated above is non-convex with respect to all of the variables *S, L, w,* but the problem of each variable conditional on other variables being fixed is convex. So we can apply an alternating convex optimization method to solve this tri-convex problem efficiently. The following four steps listed in Algorithm (1) are implemented iteratively.

*Step 1: Fixing L and w to update S*:

When we minimize the objective function with respect to (w.r.t.) the similarity matrix *S*, we can rewrite the optimization problem as follows.

The first summation term in the objective as well as constraints are all linear, and the second summation in the objective is a simple quadratic form that can be solved in polynomial time [41]. We provide details on how SIMLR is implemented to solve this problem efficiently in **Supplementary Methods**.

*Step 2: Fixing S and w to update L*:

When we minimize the objective function w.r.t. the latent matrix *L*, we can rewrite the optimization problem as follows.

The trace of *L ^{T}*(

*S-I*)

_{N}*L*is maximized when

*L*is an orthogonal basis of the eigenspace associated with the

*C*largest eigenvalues of (

*S*−

*I*) [42]. Thus,

_{N}*L*can be computed efficiently using any matrix numeric toolbox (with implementation details provided in

**Supplementary Methods**).

*Step 3: Fixing S and L to update w*:

When we minimize the objective function w.r.t. the kernel weights *w*, we can re-write the optimization problem as follows.

The problem with a convex objective and linear constraints can be solved by any standard convex optimization method [41]. Details of the step of updating *w* and choices of the specific multiple kernels we choose are stated in **Supplementary Methods**.

*Step 4: Similarity enhancement*

We apply a diffusion-based step to enhance the similarity matrix *S* and reduce the effects of noise and especially dropouts in single-cell data. Given *S* we construct a transition matrix *P*such that
where *A _{K}*(

*i*) represents the set of indices of cells that are the

*K*top neighbors of cell

*i*under the learned distance metric. Under this construction, the transition matrix is sparse so we preserve most of the similarity structure. The diffusion-based method that we apply to enhance the similarity

*S*has the following update scheme: where we have as an input and the final iteration of

*H*is used as the new similarity measure

_{ij}*S*. We show that the diffusion process converges in

_{ij}**Supplementary Methods**.

SIMLR iterates the four steps above until convergence. We use the eigengap, defined as the difference between the *C* +1th and *C*th eigenvalues [43], (**Supplementary Methods**) as the convergence criterion. When the method has converged, the similarity *S* should be stable and so its eigengap should be stable too. Further, a good low rank similarity *S* should have a small eigengap. We show the dynamics of the eigengap during iterations in SIMLR on the four real data sets (**Supplementary Figure 11**). SIMLR converges within around 10 iterations.

## Software Availability

SIMLR is freely available as both a MATLAB program and an R package (https://github.com/BatzoglouLabSU/SIMLR). For K-means clustering, we used the kmeans implementation for MATLAB [39] and the built-in function in R. For stochastic neighbor embedding, we modified the source code of the t-SNE implementation in MATLAB [19] and that in R [40] respectively.

## Additional File

The following additional file is available with this paper. Additional file 1: **Supplementary Information** includes Supplementary Notes, Figures and Methods.

## Abbreviations

SIMLR: Single-cell Interpretation via multi-kernel enhanced similarity learning; mESCs: mouse embryonic stem cells; NMI: Normalized mutual information; NNE: Nearest neighbor error; MDS: Multidimensional scaling; FA: Factor analysis; PCA: Principal component analysis; PPCA: Probabilistic principal components analysis; ZIFA: Zero-inflated factor analysis; SNE: Stochastic neighbor embedding; t-SNE: t-distributed stochastic neighbor embedding.

## Competing Interests

SB is co-founder of DNAnexus and member of the scientific advisory boards of 23andMe, Genapsys and Eve Biomedical.

## Authors’ Contributions

BW, JZ and SB conceived the study and planned experiments. BW designed the algorithm and implemented the software in MATLAB. DR developed the software package in R. BW, JZ and EP performed data analysis implemented the simulation study. JZ and EP drafted the manuscript. BW and SB contributed to the manuscript. All authors read and approved the final manuscript.

## Acknowledgements

The authors would like to thank Grace X Zheng, Jessica Terry, Tarjei Mikkelsen from 10x Genomics for providing access to the PBMC data as well as suggestions for the manuscript and the *in silico* experiments. EP acknowledges support from an NDSEG Fellowship and a Hertz Fellowship. JZ acknowledges support Stanford Graduate Fellowship.