Fast, sensitive, and accurate integration of single cell data with Harmony

The rapidly emerging diversity of single cell RNAseq datasets allows us to characterize the transcriptional behavior of cell types across a wide variety of biological and clinical conditions. With this comprehensive breadth comes a major analytical challenge. The same cell type across tissues, from different donors, or in different disease states, may appear to express different genes. A joint analysis of multiple datasets requires the integration of cells across diverse conditions. This is particularly challenging when datasets are assayed with different technologies in which real biological differences are interspersed with technical differences. We present Harmony, an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Unlike available single-cell integration methods, Harmony can simultaneously account for multiple experimental and biological factors. We develop objective metrics to evaluate the quality of data integration. In four separate analyses, we demonstrate the superior performance of Harmony to four single-cell-specific integration algorithms. Moreover, we show that Harmony requires dramatically fewer computational resources. It is the only available algorithm that makes the integration of ∼ 106 cells feasible on a personal computer. We demonstrate that Harmony identifies both broad populations and fine-grained subpopulations of PBMCs from datasets with large experimental differences. In a meta-analysis of 14,746 cells from 5 studies of human pancreatic islet cells, Harmony accounts for variation among technologies and donors to successfully align several rare subpopulations. In the resulting integrated embedding, we identify a previously unidentified population of potentially dysfunctional alpha islet cells, enriched for genes active in the Endoplasmic Reticulum (ER) stress response. The abundance of these alpha cells correlates across donors with the proportion of dysfunctional beta cells also enriched in ER stress response genes. Harmony is a fast and flexible general purpose integration algorithm that enables the identification of shared fine-grained subpopulations across a variety of experimental and biological conditions.


Harmony Scales to Enable Analysis of Large Data
As an integral part of the scRNAseq analysis pipeline, the integration algorithm must run in a reasonable amount of time 111 and within the memory constraints of standard computers. To this end, we evaluated the computational performance of 112 Harmony, measuring both the total runtime and maximum memory usage. To demonstrate the scalability of Harmony versus 113 other methods, we downsampled from HCA data 11 (528,688 cells from 16 donors and 2 tissues) to create 5 benchmark 114 datasets with 500,000, 250,000, 125,000, 60,000, and 30,000 cells. We reported the runtime (Table S2) and memory (Table   115 S3) for all benchmarks. Harmony runtime scaled well for all datasets ( Figure 3A) were able to identify shared populations ( Figure 3F) using canonical markers ( Figure S4). These results demonstrate that 129 Harmony is computationally efficient and capable of analyzing even large datasets (10 5 -10 6 cells) on personal computers.

150
We considered a more complex experimental design, in which integration must be performed simultaneously over more than 151 one variable. We gathered human pancreatic islet cells from independent five studies 12-16 , each of which were generated 152 with a different technological platform. Integration across platforms is already challenging. However, within two 12, 13 of the 153 datasets, the authors also reported significant donor-specific effects. In this scenario, a successful integration of these studies 154 must account for the effects of both technologies and donors, which may both affect different cell types in different ways.

155
Harmony is the only single cell integration algorithm that is able to explicitly integrate over more than one variable, hence 156 we omit a comparison against other methods.
3.27]) and donor ( Figure 5D,E, median iLISI 4.71 95% [1.81, 6.36]). Intriguingly, we also observed an alpha cell subset that to our knowledge, has not been previously described. This cluster 172 was also enriched with genes involved in ER stress ( Figure    Key genes necessary for endocrine function were downregulated in the alpha (K) and beta (L) ES stress clusters.

Discussion
datasets, identification of both broad populations and fine-grained subpopulations, and flexibility to accommodate complex 181 experimental design. We evaluated the degree of mixing among datasets using a quantitative metric, the iLISI. Apart from 182 its use benchmarking, iLISI was particularly important in analyses with more than 3 datasets. Here, we observed that the 183 commonly utilized approach of assessing integration visually was subjective and insensitive, particularly when the number 184 of samples, batches or cell types was large. iLISI provides a quantitative and interpretable metric to help guide analysis.

185
In the computational efficiency benchmarks, we found that 3 of the 5 algorithms were not able to scale beyond 125,000 186 cells because they exceeded the memory resources of our 128GB servers. We were struck by the fact many researchers 187 routinely analyze data on personal computers, which often do not exceed 8 or 16GB. Harmony, which only required 7.2GB 188 to integrate 500,000 cells, is the only algorithm that would enable the integration of large datasets on personal computers.

189
With the pancreatic islet meta-analysis, we demonstrated that Harmony is able to account simultaneously for donor and or cell quality. In the future, Harmony should also be able to account for such non-discrete sources of variation. 195 We noticed that it is not an uncommon practice to apply a batch-sensitive gene scaling step step before using a single- S8C) and may indeed even increase error ( Figure S8D). For this reason, we do not use this scaling strategy in this manuscript.

201
As part of a universal pipeline, Harmony finds highly integrated embeddings without the need for within-dataset scaling.

202
Harmony allows the user to set a hyperparameter for each covariate that guides how deeply to integrate over each source 203 of variation. When the penalty is 0, Harmony performs minimal integration. Curiously, we noticed that for small inter- Harmony is designed to accept a matrix of cells and covariate labels for each cell as input, and it will output a matrix 211 of adjusted coordinates with the same dimensions as the input matrix. As such, Harmony should be used as an upstream 212 step in a full analysis pipeline. Downstream analyses, such as clustering, trajectory analysis, and visualization, can use 213 the integrated Harmony adjusted coordinates instead of the commonly used PCA coordinates. As a corollary, Harmony 214 does not alter the expression values of individual genes to account for dataset-specific differences. We recommend using a 215 batch-aware approach, such as a linear model with covariates, for differential expression analysis.

216
In our meta-analysis of pancreatic islet cells, we identified a previously undescribed rare subpopulation of alpha ER 217 stress cells ( Figure 5F,J). Similar to beta ER stress cells, they appear to have reduced endocrine function ( Figure 5K).

218
Because Harmony integrated over both donors and technology ( Figure 5C,D,E), we were able to identify the significant 219 association between the proportion of alpha to beta ER stress populations across donors ( Figure 5L). Based on this corre- Note that the correction procedure uses Z, notẐ to regress out confounder effects. In this way, we restrict correction to 234 a linear model of the original embedding. An alternative approach would use the outputẐ of the last iteration as input to 235 the correction procedure. Thus, the finalẐ would be the result of a series of linear corrections of the original embedding.
limit the transformation reflects the notion in the introduction. Namely, if we had perfect knowledge of the cell types before 238 correction, we would linearly regress out batch within each cell type. 239 Lastly, we note that Z can be any arbitrary embedding of the cells. In this paper, we use principal components for 240 scRNAseq datasets. However, Harmony can be efficiently run on any low dimensional embedding of the data, including 241 diffusion maps, autoencoders, or independent components.    We developed a clustering algorithm to maximize the diversity among batches within clusters. We present this method as 263 follows. First, review a previously published objective function for soft k-means clustering. 264 We then add a diversity maximizing regularization term to this objective function, and derive this regularization term as 265 the penalty on statistical dependence between two random variables: batch membership and cluster assignment. We then 266 derive and present pseudocode for an algorithm to optimize the objective function. Finally, we explain key details of the 267 implementation. The basic objective function for classical K means clustering, in which each cell belongs to exactly one cluster, is defined by 270 the distance from cells to their assigned centroids.
Above, Z is some feature space of the data, shared by centroids Y . R i,k can takes values 0 or 1, denoting membership of 272 cell i in cluster k. In order to transform this into a soft clustering objective, we follow the direction of 24 and add a an entropy 273 regularization term over R, weighted by a hyperparameter σ. Now, R ki can take values between 0 and 1, so long as for a 274 given cell i, the sum over cluster memberships k R ki equals 1. That is, R i· must be a proper probability distribution with As σ approaches 0, this penalty approach hard clustering. As σ approaches infinity, the entropy of R outweighs the 277 data-centroid distances. In this case, each data point is assigned equally to all clusters.

278
Objective Function for Maximum Diversity Clustering

279
The full objective function for Harmony's clustering builds on the previous section. In addition to soft assignment regular-280 ization, the function below penalizes clusters with low batch-diversity, for all defined batch variables. This penalty, derived 281 in the following section, depends on the cluster and batch identities Ω(R, φ For each batch variable, we add a new parameter θ f . θ f decides the degree of penalty for dependence between batch 283 membership and cluster assignment. When ∀ f θ f = 0, the problem reverts back to (2), with no penalty on dependence. As 284 θ f increases, the objective function favors more independence between batch f and cluster assignment. As θ f approaches 285 infinity, it will yield a degenerate solution. In this case, each cluster has an equivalent distribution across batch f . However, 286 the distances between cells and centroids may be large. Finally, σ is added to this term for notational convenience in the 287 gradient calculations. 288 We found that this clustering works best when we compute the cosine, rather than Euclidean distance, between Z and 289 Y . Haghverdi et al 25 showed that the squared Euclidean distance is equivalent to cosine distance when the vectors are L 2 290 normalized. Therefore, assuming that all Z i and Y k have a unity L 2 norm, the squared Euclidean distance above can be 291 re-written as a dot product.
Cluster Diversity Score

293
Here, we discuss and derive the diversity penalty term Ω(·), defined in the previous section. For simplicity, we discuss 294 diversity with respect to a single batch variable, as the multiple batch penalty terms are additive in the objective function.

295
The goal of Ω(·) is to penalize statistical dependence between batch identity and cluster assignment. In statistics, dependence Next, we define the KL divergence in terms of R. Note that both O and E depend on R. However, in the derivation below, 303 we expand one of the O terms. This serves a functional purpose in the optimization procedure, described later. Intuitively,

304
in the update step of R for a single cell, we compute O and E on all the other cells. In this way, we decide how to assign the 305 single cell to clusters given the current distribution of batches amongst clusters.
For multiple batch variables, we sum over these convergence terms to get the penalty in equation 3. 307

Optimization
Optimization of 4 admits an Expectation-Maximization framework, iterating between cluster assignment (R) and centroid 309 (Y ) estimation .

310
Cluster assignment R. Using the same strategy as, 24 we solve for the optimal assignment R i for each cell i. First we set up 311 the Lagrangian with dual parameter λ and solve for the partial derivative wrt each cluster k.
Finally, we substitute exp(− λ σ − 1) to remove the dependency of R ki on the dual parameter λ.
The denominator term above makes sure that R i sums to one. In practice (alg 2), we compute the numerator and divide 315 by the sum.

316
Centroid Estimation Y. Our clustering algorithm uses cosine distance instead of Euclidean distance. In the context of 317 kmeans clustering, this approach was pioneered by Dhillon et al. 26 We adopt their centroid estimation procedure for our 318 algorithm. Instead of just computing the mean position of all cells that belong in cluster k, this approach then L 2 normalizes 319 each centroid vector to make it unit length. Note that normalizing the sum over cells is equivalent to normalizing the mean 320 of the cells. In the soft clustering case, this summation is an expected value of the cell positions, under the distribution 321 defined by R. That is, re-normalizing R ·k for cluster k gives the probability of each cell belonging to cluster k. Again, this 322 re-normalization is a scalar factor that is irrelevant once we L 2 normalize the centroids. Thus, the unnormalized expectation 323 of centroid position for cluster k would be Y k = E R ·k Z = i R ki Z i . In vector form, for all centroids, this is Y = ZR T .

324
The final position of the cluster centroids is given by this summation followed by L 2 normalization of each centroid. This 325 procedure is implemented in algorithm 2 in the section {Compute Cluster Centroids}.

326
Algorithm 2 Maximum Diversity Clustering The update steps of R and Y derived above form the core of Maximum Diversity Clustering, outlined as algorithm 2. This 328 section explains the other implementation details of this pseudocode. Again, for simplicity, we discuss details related to 329 diversity penalty terms θ, φ, O, and E for each single batch variable independently.

330
Block Updates of R. Unlike in regular kmeans, the optimization procedure above for R cannot be faithfully parallelized, as 331 the values of O and E change with R. The exact solution therefore depends an online procedure. For speed, we can coarse 332 grain this procedure and update R in small blocks (e.g. 5% of the data). Meanwhile, O and E are computed on the held out data. In practice, this approach succeeds in minimizing the objective for sufficiently small block size. In the algorithm, these 334 blocks are included as the Update Blocks in the for loop.

335
Centroid Initialization. We initialize cluster centroids using a batch balanced random medeoid approach. We randomly 336 select K cells within each batch for a total of B × K centroids. We then randomly select one centroid from each batch, 337 average it, and L 2 normalize it to create a total of K centroids. With this strategy, any single centroid is unlikely to be 338 centered within one batch.

340
The diversity penalty term ( Ebk /Obk) θ can tend towards infinity if there are no cells from batch b assigned to cluster k.

341
This extreme penalty can erroneously force cells into an inappropriate cluster. To protect against this, we smooth this term 342 to ensure that the denominator will not be zero. The strategy is to use E as a prior and add some portion α ∈ In this section, we refer to all effects to be integrated out of the original embedding as batch effects. This does not imply 355 that these effects are purely technical. This terminology is only meant for convenience. After clustering, we estimate and 356 correct for additive batch effect in low dimensional space. The intuition behind our approach is to give each cell within a 357 batch a different batch effect term, depending on its cell type/state. As we do not know the latter a priori, we use cluster 358 membership as a surrogate variable. For simplicity, we choose to limit our treatment of batch effect to the additive mode and 359 avoid estimating scaling effects. Note that although there are potentially multiple batch assignments handled in the clustering 360 above, the regression step currently deals with only one variable. Here, we assume that φ is φ (1) .

361
Simple Additive Batch Model. Before we jump in to the full mixture model, we review a simpler model of additive batch.

363
Here, batch effect is global and is not sensitive to any surrogate variables, such as cluster membership. We assume the data 364 is normally distributed and model each sample (i) with a multivariate Gaussian.
The covariance matrix Σ is a diagonal matrix with the same variance σ 2 for each dimension. The mean is a sum of a 366 global intercept vector µ and a batch effect vector βφ i . In this example, β ∈ R d×B defines an offset vector for each batch. 367 We use φ i to index into the appropriate batch of β (i.e. βφ i ∈ R d×1 ). In order to regress out the batch effect, we estimate the 368 parameters µ and β and subtract the batch offsets (βφ i ) from the raw data. Since batch effect here is purely additive, we do 369 not estimate variance. µ is the global mean of the dataset (µ = N i=1 Z i /N ). β terms are estimated by maximizing the log 370 likelihood of (9) wrt each β b term independently.
Thus, the batch offset of batch b is the mean of cells inside the batch minus the global mean. We subtract this offset from 372 the raw data (Z) to get the corrected data (Ẑ).
In the following section, we follow the same approach laid out here to incorporate soft cluster membership into the model. . Unlike the simpler model, each cluster has its own intercept µ k and set of batch terms β bk .
In this new generative model, each sample is explained as a mixture of Gaussians, with weights R ki . The cluster specific 380 intercepts (µ k ) are the (non-spherical) cluster centroids. We compute the MLE β terms using standard methods for GMMs.

381
It is known that the GMM does not have a convex log likelihood, so we use the the expected log likelihood instead. Fixing 382 µ k and R ki , we solve for the MLE β bk terms.
Note that (10) and (13) have similar forms. Both compute the expected difference between the batch cells and some µ.

384
In the simple model, µ is a global term, while in the mixture model, µ k is specific to a local cluster k. In order to correct the 385 data in this model, we need a batch offset term for each cell. (12) shows us that this is done by taking expectations over β k φ i 386 terms: Implementation. Algorithm 3 lays down the pseudocode for GMM Correct, the mixture model correction procedure derived 388 above. For a more vectorized implementation, we rewrite the (13) in terms of each β b ∈ R d×K matrix and (14) in terms of 389 the whole Z matrix.
Caveat. This section assumes the modeled data are orthogonal and each normally distributed. This is not true for the L 2 report the entropy of these distributions. Our metric for local diversity is related to these approaches, in that we start with a 400 KNN graph. However, our approach considers two problems that these do not.

407
Our diversity score, the Local Inverse Simpson's Index (LISI) addresses both points. To be sensitive to local diversity, we 408 build Gaussian kernel based distributions of neighborhoods. This gives distance-based weights to cells in the neighborhood 409 and gives less diversity to . The current implementation computes these local distributions using a fixed perplexity (default 410 30), which has been shown to be a smoother function than fixing the number of neighbors. We address the second issue of interpretation using the Inverse Simpson's Index (1/ B b=1 p(b)). All data analyzed in this manuscript is publicly available through online sources. We included links to all data sources in 427   Table S6.

428
Preprocessing scRNAseq data. 429 We downloaded raw read or UMI matrices for all datasets, from their respective sources. The one exception was the 3pV1 430 dataset from the PBMC analysis. These data were originally quantified with the hg19 reference, while the other two PBMC 431 datasets were quantified with GRCh38. Thus, we downloaded the fastq files from the 10X website (Table S6). We quantified 432 gene expression counts using Cell Ranger 10, 29 v2.1.0 with GRCh38. From the raw count matrices, we used a standard 433 data normalization procedure, laid out below, for all analyses, unless otherwise specified. Except for the L 2 normalization 434 and within-batch variable gene detection, this procedure follows the standard guidelines of the Seurat single cell analysis 435 platform. 436 We filtered cells with fewer than 500 genes or more than 20% mitochondrial reads. In the pancreas datasets, we filtered 437 cells with the same thresholds used in Butler et al: 6 1750 genes for CelSeq, 2500 genes for CelSeq2, no filter for Fluidigm C1, 438 2500 genes for SmartSeq2, and 500 genes for inDrop. We then library normalizes each cell to 10,000 reads, by multiplicative 439 scaling, then log scaled the normalized data. We then identified the top 1000 variable genes, ranked by coefficient of variation, within in each dataset. We pooled these genes to form the variable gene set of the analysis. We used the UMAP algorithm 30 to visualize cells in a two dimensional space. For all analyses, UMAP was run with the 448 following parameters: k = 30 nearest neighbors, correlation based distance, and min dist = 0.1.

449
Comparison to other algorithms. For the pancreas analysis, we set τ = 5. 468 We set donors to be the primary covariate (θ = 2) and technology secondary (θ = 4).

469
Identification of alpha and beta ER stress subpopulations.