Monet: An open-source Python package for analyzing and integrating scRNA-Seq data using PCA-based latent spaces

Single-cell RNA-Seq is a powerful technology that enables the transcriptomic profiling of the different cell populations that make up complex tissues. However, the noisy and high-dimensional nature of the generated data poses significant challenges for its analysis and integration. Here, I describe Monet, an open-source Python package designed to provide effective and computationally efficient solutions to some of the most common challenges encountered in scRNA-Seq data analysis, and to serve as a toolkit for scRNA-Seq method development. At its core, Monet implements algorithms to infer the dimensionality and construct a PCA-based latent space from a given dataset. This latent space, represented by a MonetModel object, then forms the basis for data analysis and integration. In addition to validating these core algorithms, I provide demonstrations of some more advanced analysis tasks currently supported, such as batch correction and label transfer, which are useful for analyzing multiple datasets from the same tissue. Monet is available at https://github.com/flo-compbio/monet. Ongoing work is focused on providing electronic notebooks with tutorials for individual analysis tasks, and on developing interoperability with other Python scRNA-Seq software. The author welcomes suggestions for future improvements.

Single-cell RNA-Seq (scRNA-Seq) has become a widely used technology to elucidate the transcriptomes 23 of individual cell populations in complex tissues, with applications in immunology, cancer research, 24 developmental biology, neurobiology, and other fields. The analysis of scRNA-Seq data presents a 25 unique combination of computational and statistical challenges [1][2][3][4] : First, the data is very noisy, mostly 26 due the fact that only a random subset of mRNA molecules from each cell is detected. Therefore, all 27 scRNA-Seq analysis methods must adopt strategies aimed at separating biological expression differences 28 from technical noise. Second, the data is very high-dimensional, making it essential to employ some 29 form of dimensionality reduction. In high-dimensional space, cells all appear nearly equidistant from one 30 another, an effect sometimes referred to as the `curse of dimensionality`. Third, datasets are very large, 31 often containing data for thousands of cells, making it difficult to efficiently store and load data, both 32 on-disk or in-memory. Fourth, in addition to the biological heterogeneity present within one dataset, 33 researchers are commonly interested in studying heterogeneity across datasets (e.g., differences 34 between treatment conditions or individuals), posing challenges as to how to jointly analyze, or 35 2/18 "integrate", multiple datasets. This requires methods that can overcome or correct for batch effects, the 36 nature and magnitude of which are often unknown. 37 These technical challenges underlie and permeate almost any aspect of scRNA-Seq data analysis, 38 independently of whether the ultimate goal is to obtain a particular visualization of the data, to perform 39 clustering, to order cells along a developmental trajectory, or to make comparisons between datasets. 40 Since there often exist many different approaches for each type of analysis (e.g., many different 41 clustering algorithms), and many different approaches to address each of the aforementioned technical 42 challenges, it is perhaps not surprising that hundreds of scRNA-Seq analysis tools have been developed 5 . 43 However, even for experienced computational biologists, navigating this vast methodological landscape 44 can be difficult, as it often requires significant effort to understand how the approaches chosen by a 45 particular tool affect the data and interact with each other to produce the final analysis result. 46 To allow researchers to perform common scRNA-Seq analysis tasks without having to navigate hundreds 47 of different tools, multiple "comprehensive" software packages for analyzing scRNA-Seq data have been 48 developed. The most popular examples include the R packages Seurat 6 and Monocle 7 , as well as the 49 Python package Scanpy 8 . In principle, these packages can implement a "core analysis framework" for 50 addressing the aforementioned technical challenges, while providing a user-friendly interface for 51 performing different scRNA-Seq analysis tasks. To be able to properly interpret analysis results, 52 researchers need to develop an understanding of how the core analysis framework operates, at least at 53 an intuitive level. However, it is much more feasible to familiarize oneself with a single framework than 54 with dozens of independently developed tools with narrower focus. Package authors should therefore 55 publish clear explanations of the core analysis framework. 56 Here, I describe a new Python software package termed Monet for analyzing scRNA-Seq data. The core 57 analysis framework of this package consists of an algorithm to learn a PCA-based latent space from a 58 given dataset, with the dimensionality being automatically determined using molecular cross-59 validation 9 , as well as an algorithm to project arbitrary scRNA-Seq datasets (usually from the same 60 tissue) into such a latent space. While PCA is commonly used in the analysis of scRNA-Seq data 10 , 61 Monet's core analysis framework avoids or replaces many of the steps commonly used by other 62 packages in the preprocessing of the data, including gene selection, log transformation, or any kind of 63 parametric modeling 2,11 . It also explicitly puts latent spaces, encapsulated by MonetModel objects, at 64 the center of the analysis of scRNA-Seq data. Monet relies as much as possible on standard machine 65 learning algorithms to perform specific tasks (e.g., visualization with t-SNE, clustering with DBSCAN, K-66 nearest-neighbor classification for label transfer), while also implementing successful ideas previously 67 described in the single-cell literature (e.g., batch correction by matching mutual nearest neighbors 12 ). 68 The package also contains an implementation of ENHANCE, a previously developed denoising method 3 69 that uses the Monet latent space model (with a simpler heuristic for inferring dimensionality) in its k-70 nearest neighbor aggregation step. 71 72 3/18 Python tools for data manipulation, machine learning and visualization   74   To develop a software for analyzing scRNA-Seq data in Python, I relied on successful open-source  75 packages from the Python ecosystem (Figure 1a). Expression matrices in Monet are represented using 76

Monet leverages
the ExpMatrix class, which is a subclass of the pandas DataFrame class. To store and load raw scRNA-Seq 77 data consisting of UMI counts for each gene and each cell, I found that numpy's compressed .npz binary 78 format offers much better performance in terms of disk usage and loading times than plain-text formats. 79 The save_npz() and load_npz() functions of ExpMatrix objects use this format to efficiently save and load 80 data to/from the hard drive, respectively. For statistical and machine learning tasks, Monet relies heavily 81 on scikit-learn and scipy, while plotly is used to generate visualizations that can be embedded into 82 Jupyter notebooks. These packages offer an incredibly broad set of features and are actively maintained 83 and developed. They also can be easily installed using the conda package manager, although Monet 84 currently only supports installation with the pip package manager. Work to make Monet installable with 85 conda is ongoing. 86 The core analysis framework relies on the application of PCA to the UMI count matrix, after applying median scaling a simple square root-based data transformation. c Overview of the core analysis framework, potential analysis tasks, and code examples. A Monet model is obtained by inferring the dimensionality using molecular cross-validation, applying k-nearest neighbor aggregation, and then performing PCA on the aggregated (and re-scaled) data. This model then serves as the basis for various downstream analysis tasks.

4/18
Monet's core analysis framework relies on simple data transformations and PCA 87 Gene expression measurements obtained from scRNA-Seq, represented by UMI counts, are associated 88 with significant levels of technical noise. The amount of noise strongly depends on the expression level 89 of the gene, but in many cases exceeds 100% (coefficient of variation), in which case the standard 90 deviation representing the technical variation is larger than the true expression level. In 2014, Grün et 91 al. 1 observed that the technical noise displayed by UMI counts can be understood as a combination of 92 sampling noise and efficiency noise, where sampling noise refers to the stochastic variation introduced 93 because only a small random subset of transcripts for each cell is detected, whereas efficiency noise 94 refers to stochastic differences in the overall number of transcripts detected for each cell. The authors 95 further observed that sampling noise was the dominant source of technical variation for all except the 96 most highly expressed genes. In a 2017 paper 13 , I built on these observations and proposed to 97 preprocess scRNA-Seq datasets by using a two-step procedure (Figure 1b). In the first step, the 98 expression profiles of all cells are scaled to the median transcript count per cell, in order to counteract 99 efficiency noise, which was already discussed by Grün et al. In the second step, a simple square root-100 based transform, = √ + √ + 1, is applied to the scaled expression values. The main motivation for 101 using this Freeman-Tukey transform 14 is to let measurements contribute to the downstream PCA step in 102 approximate proportion to their signal-to-noise ratio, meaning that the relatively accurate 103 measurements obtained for highly expressed genes contribute more (but not too much) to the analysis 104 than those of lowly expressed genes, which contain very little information. This simple transform 105 therefore obviates the need for a gene selection step, which is used by many scRNA-Seq analysis tools 2 . 106 The reader may refer to the Methods section for a discussion of the effects of different data 107 transformations. After scaling and applying the FT transform, Monet performs principal component 108 analysis (PCA) on the data, using a fast randomized implementation provided by scikit-learn based on 109 algorithms described by Halko et al. 15 Monet complements this simple approach to performing PCA on 110 scRNA-Seq data with algorithms for inferring the dimensionality of a dataset and for reducing PCA 111 overfitting, which are described below. The resulting Monet model represents a tissue-specific latent 112 space that can form the basis for many different analysis tasks (Figure 1c). 113 simulation studies and statistical theory to show that when this is done in an appropriate fashion, the 120 parameter value that minimizes a loss function on the test dataset (MCV loss) is also the value that 121 minimizes the ground truth loss. Monet implements 5-fold MCV with a Poisson loss function to infer the 122 dimensionality in the context of the previously described PCA framework (see Methods). 123 I first tested this approach on three different human PBMC datasets obtained using 10x Genomics' 124

Monet infers the dimensionality of a dataset using molecular cross-validation
Chromium technology (Figure 2a). For the two PBMC datasets obtained using the v2 chemistry, Monet 125 inferred a significantly lower dimensionality (19 and 22) than for the dataset obtained using the v3 126 chemistry (30). This is consistent with the fact that the v3 chemistry achieves a much higher transcript 127 5/18 detection rate, and therefore is able to produce a higher-resolution view of the different 128 subpopulations of cells. Both v2 PBMC datasets were obtained using cells from the same donor, and 129  ) for the  130  dataset containing ~4,000 cells than for the dataset containing ~8,000 cells (22), which again seemed  131 consistent. Finally, I tested the approach on a mouse embryonic heart dataset, for which Monet inferred 132 a dimensionality of 52. This was significant higher than for any of the PBMC datasets, which was 133 consistent with the fact that the heart dataset appeared to contain a much larger number of distinct cell 134 types. 135 To quantitatively validate the MCV-based inference of dimensionality, I modified a previously described 136 PCA-based approach to simulate scRNA-Seq data using real datasets as templates 3 , which allowed me to 137 generate artificial scRNA-Seq datasets with a truncated and thus clearly defined dimensionality. I 138 simulated human PBMC datasets with dimensionalities of 5-15, and mouse embryonic heart datasets 139 with dimensionalities of 10-30. In all cases, Monet was able to infer the correct dimensionality for the 140 simulated datasets (Figure 2b). It should be noted that in real-world datasets where the dimensionality 141 is not artificially truncated, there is typically no sharp transition from dimensions that capture biological 142 expression differences to those that only capture technical noise. This is why the Poisson loss curves 143 look flatter for the real datasets than in the simulation study. Nevertheless, these results showed that 144 Monet's MCV-based inference of dimensionality provided a valid and more systematic way of 145 determining the dimensionality than the commonly used method of making a guess based on an "elbow 146 plot" that shows the explained variance per PC. 147

Monet reduces PCA overfitting by performing nearest-neighbor aggregation
148 PCA is a highly effective tool for reducing the dimensionality of scRNA-Seq data. In doing so, it separates 149 biological expression differences, captured by the first few PCs, from technical noise, which is captured 150 by higher PCs. In addition to reducing the dimensionality, PCA therefore also denoises scRNA-Seq data 3 . 151 However, when applied to raw UMI counts, higher PCs tend to capture a small fraction of technical 152 noise, which does not exhibit significant correlation structure. This effect can be described as overfitting, 153 as it represents an example of a model capturing unwanted sources of variation. To reduce overfitting, 154 Monet implements a nearest-neighbor aggregation step, which reduces overall noise levels, and thus 155 reduces the extent to which individual PCs capture noise. I performed simulation studies to quantify the 156 extent of this effect in datasets generated with the Chromium v3 technology, and found that the 157 improvements appear relatively minor (Figure 2c). The PCs obtained after the aggregation step did not 158 change the percentage of variance explained in the (noiseless) ground truth, however they did lower the 159 percentage of variance explained in the (noisy) simulated data by 2-3%, indicating that overfitting was 160 reduced. A second round of nearest-neighbor aggregation did not improve the results further. Since the 161 simulation method itself relied on PCA, these results are likely biased, and I expect the benefits in real-162 world applications to be somewhat larger. Additional simulation studies will be required to better 163 quantify this effect. 164 165 As discussed, the analysis of individual scRNA-Seq datasets presents a number of statistical and 166 computational challenges, and there is still surprisingly little consensus as to how to perform even basic 167 tasks such as clustering 4 . However, most single-cell studies require a joint analysis of multiple datasets, 168 for example to compare between different individuals, drug treatments, or genetic backgrounds. In 169 addition, studies often stand to benefit from direct comparisons with previously published scRNA-Seq 170

Monet enables batch correction by identifying mutual nearest neighbors
datasets. In all of these instances, researchers need to adopt strategies to overcome batch effects, 171 7/18 which is a catch-all term for all sources of variation that represent technical artifacts rather than true 172 biological expression differences. For example, strong batch effects can be expected when comparing 173 datasets generated using different scRNA-Seq technologies. However, as the precise extent and nature 174 of these sources of variation is typically unknown, batch correction methods generally have to rely on 175 certain assumptions in discriminating technical from biological effects. The development and 176 benchmarking of batch correction methods for scRNA-Seq data is an active area of investigation 16 . 177 A straightforward and useful method for batch correction was described by Haghverdi et al. 12 ,who 178 proposed to identify of pairs of cells from two datasets that represent mutual nearest neighbors 179 (MNNs). These cell pairs would allow the calculation of batch correction vectors, which represent the 180 the batch effect present in a target dataset, relative to a reference. The authors reasoned that after 181 subtracting these batch correction vectors from the cells in the target dataset, any remaining differences 182 between the datasets would represent true biological differences. In effect, this approach assumes that 183 some cell populations are shared between the two datasets, whereas others are unique to either the 184 reference or the target dataset. After applying the batch correction, it should be possible to identify the 185 populations only present in one dataset, but not the other. 186 Monet implements a modified version of this approach using the correct_mnn() function, where the 187 batch correction is performed in PC space, whereas the originally proposed method operates directly on 188 gene expression values. To test Monet's implementation, I applied batch correction using a human 189 PBMC dataset obtained using the Chromium v3 technology as the reference, and another human PBMC 190 dataset using the Chromium v2 dataset as the target. The difference in technologies results in a strong 191 batch effect, and a clear visual separation of clusters by dataset (Figure 3a, top). After applying batch 192 correction, cells can be seen to cluster by cell type, with all clusters containing cells from both datasets 193 (Figure 3a, bottom). An advantage of the MNN-based approach to batch correction is that it does not 194  To demonstrate this on an extreme  195  example, I computationally removed all T cells from either the reference or the target dataset, and  196 applied batch correction again. In both cases, the results for the other cell types were unaffected ( Figure  197 3b), confirming that this approach is robust to differences in cell type composition between samples. 198 The batch correction algorithm for these datasets consisting of almost 20,000 cells took approximately 199 47 seconds. In summary, Monet implements an effective and efficient MNN-based algorithm for batch 200 correction in PC space. 201 Monet enables accurate label transfer between samples from the same tissue 202 A more supervised approach to overcoming batch effects is to perform clustering on a reference 203 dataset, and to then use machine learning methods to directly transfer cluster labels to other datasets 204 representing samples from the same tissue. The development of such label transfer methods is also a 205 highly active area of investigation 4,17,18 . It is also an area where multiple deep learning-based approaches 206 have been proposed 19,20 . However, as is the case for many scRNA-Seq analysis tasks, it is not clear how 207 much methodological complexity is truly necessary to address this problem, especially in the commonly 208 encountered scenario where a researcher simply wishes to transfer labels between samples from the 209 same tissue. 210 Monet implements a label transfer method that relies on training a standard k-nearest neighbor (kNN) 211 classifier on the reference data, after projecting it into the latent space represented by a Monet model. 212 A target dataset can then be projected into the same latent space, and labeled using the kNN classifier 213 (Figure 4a). Monet uses a default value of K=20 for classification, which can be changed by the user. To 214 test this approach, I used the same human PBMC reference dataset as in the batch correction example 215 (see above), obtained using the Chromium v3 technology. I fitted a Monet model and performed 216 clustering, identifying all major cell types present (Figure 4b). I then applied the label transfer method to 217 two other human PBMC datasets. First, a dataset obtained using the same Chromium v3 technology, 218 and second, the Chromium v2 dataset that was shown in Figure 3a to exhibit strong batch effects. For 219 both datasets, I compared the label transfer results to results obtained from manual clustering ( Figure  220 4c). In both cases, Monet correctly identified the vast majority of cells from each cell type (Figure 4d), 221 demonstrating that the simple combination of a PCA-based latent space and a kNN classifier was largely 222 successful in transferring cell type annotations, even in the presence of strong batch effects. The most 223 notable exception was the failure to correctly identify approximately 12% of monocytes in the v2 224 dataset, suggesting that the batch effect for this cell type was too large to allow a reliable classification. 225 It is possible that this could be ameliorated by performing a MNN-based batch correction step (see 226 above) before applying the label transfer method. In future work, I aim to test if this approach indeed 227 leads better results when strong batch effects are present. If so, I plan to make this option directly 228 available in Monet's label transfer function. It should be noted that the transferred labels appeared to 229 provide a higher cell type resolution than what could be inferred from the t-SNE plot, suggesting that 230 label transfer can help to improve cell type resolution. In summary, the kNN classification-based 231 approach to label transfer implemented in Monet represents a simple and effective tool for transferring 232 annotations between datasets from the same tissue. 233 analyzing and integrating scRNA-Seq data. Most of the analysis tasks currently supported rely on 237 previously described methods, including visualization, clustering, denoising, and batch correction. 238 Therefore, the primary contribution of work does not lie in the development of a novel method for 239 solving a particular analysis task. Rather, this work has focused on bringing together various approaches 240 within the context of a common analysis framework, and on describing a software package that provides 241 concrete implementations of those approaches. The design of Monet was also guided by the idea that 242 analysis methods should not only be effective in accomplishing a particular task, but also 243 computationally efficient (i.e., fast and not too memory-intensive), and not unnecessarily complex. It is 244 my experience that since most analyses are exploratory in nature, algorithms that take several minutes 245 or even hours to finish tend to disrupt the researcher's analysis workflow. Moreover, the more complex 246 a method is, the more difficult it is to interpret its output and to understand the relationship between 247 raw data and analysis result, which also makes it harder to communicate research results in a 248 transparent fashion. In contrast, methods that rely on simple data transformations and standard 249 machine learning algorithms can be quite easy to understand, at least at an intuitive level. In summary, a 250 fast and simple method is generally preferable to a slow and complex method, especially when it is not 251 clear whether there is a significant difference in accuracy or effectiveness between those methods. 252 As outlined in the Introduction, "comprehensive" software packages that provide solutions to range of 253 scRNA-Seq analysis tasks play a crucial role in curating and synthesizing methodological and algorithmic 254 knowledge, and in making those methods and algorithms available to the broad community of 255 researchers that employ scRNA-Seq technologies. The popularity of comprehensive analysis packages 256 like Seurat, Scanpy and Monocle means that analysis frameworks and methods implemented by those 257 packages enjoy a far broader visibility and adoption than those only implemented by more specialized 258 packages. However, the number of successful comprehensive scRNA-Seq analysis packages is fairly 259 small, particularly for Python users, resulting in limited diversity within this space. Monet provides a 260 core analysis framework and set of methods that is largely distinct from those implemented by Seurat, 261 Scanpy, and Monocle, and I therefore hope that it contributes to an increase in diversity while providing 262 useful solutions to a number of commonly encountered analysis tasks. 263 While this work has focused on introducing and evaluating Monet's core analysis framework and 264 demonstrating its ability to support various analysis tasks, the overall usefulness of the package will also 265 depend on the availability of documentation, tutorials, and software updates. Future work will focus on 266 developing those materials and on maintaining the Monet package. 267

11/18
Methods 268 Overview of the core analysis framework 269 Several key aspects of Monet's core analysis framework have been described previously 3,21 . In particular, 270 these previous studies have described the idea of applying PCA after 1) scaling the expression profile of 271 each cell to the median transcript (UMI) count C across all cells ("median scaling") and 2) transforming 272 the scaled values using the Freeman-Tukey transform, = √ + √ + 1 (see below). In the training of a 273 Monet model, this approach is complemented with two additional steps. First, the dimensionality D of 274 the data is inferred using molecular cross-validation 9 (see below). Second, a nearest-neighbor 275 aggregation step is performed to reduce overfitting 3 . Briefly, the PC scores of each cell (of the first D transformed expression values for m cells and p genes from the dataset to be projected. It is assumed 292 here that the set of genes is identical to the set of genes in the training dataset. If necessary, this 293 assumption can be satisfied by removing any unknown genes from the dataset and inserting zero 294 measurements for any missing genes. The PC scores S for Y can then be obtained as = . 295

Effect of the Freeman-Tukey transform in comparison to the log and Anscombe transforms 296
The effect and usefulness of the Freeman-Tukey transform on scRNA-Seq data can be appreciated by 297 realizing that scRNA-Seq measurements are associated with significant amounts of technical noise, the 298 exact magnitude of which is expression level-dependent. In order to describe exactly how much noise 299 we expect to observe for a given gene in a given cell, it is helpful to think of scRNA-Seq measurements as 300 random variables, which have specific probability distributions. Based on experiments designed to 301 directly study the noise profile of scRNA-Seq measurements by analyzing "cells" containing identical 302 pools of mRNA (e.g., purified mRNA diluted into droplets), we know that UMI counts can be modeled 303 using the Poisson distribution 1,13,22-24 . Specifically, let X represent the UMI count of a particular gene in a 304 particular cell. Then X is a Poisson-distributed random variable whose expected value λ corresponds to 305 the true (relative) expression level of this specific gene in this particular cell 1,13,23 . The noise level of X is 306 described by its coefficient of variation (CV), which due to the Poisson-distributed nature of X is 307 inversely proportional to the square root of λ: CV(X) ∝ 1/√ . The signal-to-noise ratio is the reciprocal 308 12/18 of the coefficient of variation, and therefore directly proportional to the square root of λ: SNR(X) ∝ √ . 309 The Freeman-Tukey transform, defined as = √ + √ + 1, is a variance-stabilizing transform for 310 Poisson-distributed data 14 . Since it is a square root-based transform, it now becomes clear that it 311 approximately weighs each gene expression measurement according to its signal-to-noise ratio. In other 312 words, measurements of highly expressed genes, which are relatively accurate, are given more weight 313 than measurements of lowly expressed genes, which mostly represent technical noise and contain very 314 little information about true expression differences. In fact, after the Freeman-Tukey transform, 315 measurements from lowly expressed genes only have a minimal impact (quantified as the overall 316 proportion of variance associated with those genes). In contrast, after log transform, = ln + 1, those 317 measurements can contribute more heavily than they should based on their signal-to-noise ratio 13 . 318 Virtually all scRNA-Seq analysis workflows that use the log transform rely on a separate step in which 319 the G most "informative" genes from the data are selected 2 , and the fact that the log transform tends to 320 assign too much weight to lowly expressed genes appears to be one of the reasons why such a gene 321 selection step is necessary or recommended. The Freeman-Tukey transform, used in combination with 322 median scaling, makes a gene selection step unnecessary. Another transform that accomplishes the 323 same goal is the Anscombe transform, = 2 + 3/8 . However, this transform has a more extreme 324 effect on the true expression differences (see below). 325 Aside from weighing gene expression values based on their signal-to-noise ratio, the Freeman-Tukey 326 transform of course also affects the extent to which true expression differences contribute to the 327 analysis. Here, square root-based transforms also differ fundamentally from log-based transforms. The 328 log function grows much more slowly than the square root function, so expression differences in highly 329 expressed genes appear much larger after square root than after log transform, whereas true expression 330 differences in lowly expressed genes can be almost completely lost after square root-based transforms. 331 However, the Freeman-Tukey transform represents somewhat of a compromise between the log-332 transform and the Anscombe transform, which assigns even more weight to expression differences 333 between highly expressed genes than the FT transform. It should be noted that in the untransformed 334 data, biological variation can be dominated by only a handful of very highly expressed genes. 335

336
Monet's implementation of MCV closely follows the description provided by Batson et al. 9 , and the 337 reader may refer to their study for a detailed description of this method. Briefly, given a real scRNA-Seq 338 dataset X with an unknown underlying ground truth of X deep , the authors describe how to create training 339 and test datasets X' and X'' by carefully sampling from the Binomial distribution. In effect, this sampling 340 procedure partitions the observed molecules (UMIs) in X between X' and X'', while allowing for a small 341 overlap. The authors show that if done correctly, X' and X'' represent statistically independent samples 342 of X deep , mimicking a theoretical scenario in which researchers had performed two independent scRNA-343 Seq experiments on the same sample (which of course is not possible, as the scRNA-Seq destroys the 344 being cells analyzed • "4k PBMCs from a Healthy Donor" (v2-PBMC-8k): Human PBMCs, data generated using the 10x 396 Chromium v2 technology, published by 10x Genomics (https://support.10xgenomics.com/single-397 cell-gene-expression/datasets/2.1.0/pbmc4k). 398 • "10k PBMCs from a Healthy Donor (v3 chemistry)" (v3-PBMC-10k): Human PBMCs, data generated 399 using the 10x Chromium v3 technology, published by 10x Genomics 400 (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3). 401 • "5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor with cell surface proteins 402 (v3 chemistry)" (v3-PBMC-5k): Human PBMCs, data generated using the 10x Chromium v3 403 technology, published by 10x Genomics (https://support.10xgenomics.com/single-cell-gene-404 expression/datasets/3.1.0/5k_pbmc_protein_v3). 405 • "10k Heart Cells from an E18 mouse (v3 chemistry)" (v3-Heart-10k): Mouse embryonic heart 406 cells, data generated using the 10x Chromium v3 technology, published by 10x Genomics 407 (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/heart_10k_v3). 408 To focus on the expression of protein-coding genes and to reduce matrix size by approximately two 409 thirds, the following gene filtering step was performed. For all datasets, the "Feature / cell matrix 410 (filtered)" file was downloaded from the 10x Genomics website. A list of known protein-coding genes 411 was extracted from the human Ensembl genome annotations, release 97 412 (http://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz). Each 413 dataset was then filtered to only retain those known protein-coding genes, identified by their Ensembl 414 IDs. 415 Tor remove low-quality cells and to remove gene expression from genes encoded on the mitochondrial 416 genome, the following quality control steps were performed. A list of 13 protein-coding genes located 417 on the mitochondrial genome was obtained by selecting as all protein-coding genes whose names starts 418 with "MT-". For all datasets generated using the 10x Chromium v3 technology, individual cells were 419 removed if they had fewer than 2,000 measured transcripts (UMIs), or if more than 20% of measured 420 transcripts originated from those 13 mitochondrial genes. All datasets were then filtered to exclude 421 those 13 genes. 422

423
Visualizations and clustering analyses using t-SNE and DBSCAN were performed as previously 424 described 21 , using 50 principal components a perplexity of 30. Briefly, the Galapagos clustering workflow 425 was applied, consisting of median scaling, application of the Freeman-Tukey transform, and PCA. Cell 426 type annotations were based on cell type-specific marker genes. 427

428
Simulations were performed by applying the ENHANCE denoising algorithm to a real scRNA-Seq dataset, 429 using the result as the ground truth, and then simulating efficiency and sampling noise to obtain the 430 simulated data 3 . This previously described approach was slightly modified. ENHANCE was applied using 431 the dimensionality (number of PCs) inferred by Monet, rather letting ENHANCE infer the dimensionality 432 using its own heuristic. To validate the MCV approach implemented by Monet, the dimensionality of the 433 15/18 ground truth was truncated to a specified number of PCs as follows. The ground truth obtained by 434 ENHANCE can be represented by a set of PC coefficients and scores. To obtain simulated data with a 435 clearly defined dimensionality, only the data represented by the first D sim PCs were used as the ground 436 truth. 437