Moana: A robust and scalable cell type classification framework for single-cell RNA-Seq data

Single-cell RNA-Seq (scRNA-Seq) enables the systematic molecular characterization of heterogeneous tissues at an unprecedented resolution and scale. However, it is currently unclear how to establish formal cell type definitions, which impedes the systematic analysis of scRNA-Seq data across experiments and studies. To address this challenge, we have developed Moana, a hierarchical machine learning framework that enables the construction of robust cell type classifiers from heterogeneous scRNA-Seq datasets. To demonstrate Moana’s capabilities, we construct cell type classifiers for human immune cells that accurately distinguish between closely related cell types in the presence of experimental perturbations and systematic differences between scRNA-Seq protocols. We show that Moana is generally applicable and scales to datasets with more than ten thousand cells, thus enabling the construction of tissue-specific cell type atlases that can be directly applied to analyze new scRNASeq datasets. A Python implementation of Moana can be found at https://github.com/yanailab/moana.

Single-cell RNA-Seq is a breakthrough technology with applications in diverse areas of biological 22 research [1][2][3] . Thanks to exponential increases in throughput over the last years, current protocols enable 23 the efficient processing of more than ten thousand cells in a single experiment 4 . Many studies have 24 demonstrated the unprecedented power of scRNA-Seq to provide systematic views (sometimes referred 25 to as "atlases") of cell type heterogeneity across organisms, tissues, and developmental stages, which in 26 many cases has led to reports of novel cell types or states 5-8 . 27 Despite these advances, it remains unclear how to formally define cell types based on scRNA-Seq data, a 28 problem that is complicated by high levels of technical noise 9-12 . In the absence of formal cell type 29 definitions, the analysis of each new scRNA-Seq dataset requires the manual assignment of cell type 30 identities. This laborious task is affected by a large number of study-specific factors, including the choice 31 of clustering method and associated parameters 13 , the technical quality of the data, the set of potential 32 marker genes considered for each cell type, etc. Consequently, different scRNA-Seq studies of the same 33 tissue may rely on inconsistent cell type definitions, which makes it difficult to compare and synthesize 34 study results. Similarly, it is difficult to conduct scRNA-Seq follow-up studies of specific cell types without 35 an unbiased method to re-identify these cell types in a new dataset. 36 The definition of cell types can be properly framed in machine learning terms as a classification problem. 37 In other words, if there existed an algorithm that could accurately and reliably predict (assign) cell type 38 identities to cells in a new scRNA-Seq dataset, then this algorithm would embody a quantitative, 39 expression-based definition of those cell types. However, a generally applicable machine learning 40 framework for scRNA-Seq cell type classification must address a formidable array of challenges: First, it 41 must overcome the high levels of technical noise inherent to scRNA-Seq data 10-12 . Second, it must be 42 robust to systematic differences between datasets generated with different scRNA-Seq protocols 14 . 43 Third, it must be able to accommodate a potentially large number of cell types and closely related 44 subtypes that are commonly present in complex tissues. Fourth, it must be capable of processing large 45 datasets consisting of tens of thousands of cells in reasonable amounts of time, and using limited 46 computational resources (e.g., memory). Finally, it must provide a systematic approach to assess 47 classification performance in the absence of a ground truth, as scRNA-Seq datasets for experimentally 48 purified subpopulations are frequently unavailable. 49 To address some of the challenges arising from the lack of formal cell type definitions, methods that 50 facilitate the "alignment" and joint clustering analysis of multiple scRNA-Seq datasets have been 51 developed 14,15 . These approaches were shown to successfully overcome batch effects, and in principle 52 allow the cell type annotations of one dataset to serve as a reference for analyzing other scRNA-Seq 53 datasets from the same tissue. However, as these methods lack explicit representations of cell types, 54 they do not obviate the need to conduct a manual clustering analysis, and thus do not eliminate the risk 55 for study-specific biases to arise. An early scRNA-Seq study applied a multinomial mixture model to 56 study the response of specific mouse immune cell populations following exposure to lipopolysaccharide, 57 but did not quantify classification performance 16 . A recent study proposed an algorithm termed scmap-58 cluster 17 that "projects" cell types across scRNA-Seq datasets by selecting informative genes and 59 assigning cell identities to individual cells based on their correlation with average cell type expression 60 profiles. While this represented an innovative classification approach for classifying scRNA-Seq data, 61 validations were largely restricted to pancreas tissue composed of highly specialized cell types, and the 62 ability to classify cells belonging to more closely related cell types remained unclear. 63 Here, we describe a new machine learning framework, termed Moana, which enables the construction 64 of robust cell type classifiers from heterogeneous scRNA-Seq datasets. We propose a hierarchical 65 approach to clustering and classification that can accommodate very large datasets and tissues 66 composed of complex mixtures of cell types and subtypes. For classification, our framework relies on 67 support-vector machines (SVMs) with a linear kernel, trained on PCA-transformed data. To accurately 68 distinguish between closely related cell types, we leverage our previously developed kNN-smoothing 69 algorithm 11 to effectively reduce technical noise levels. We demonstrate the capabilities of our 70 framework by constructing and validating classifiers for human peripheral blood mononuclear cells 71 (PBMCs), which allow the accurate prediction of immune cell types and subtypes in data generated 72 using different scRNA-Seq protocols, as well as following experimental perturbation. We also apply our 73 framework for construct a classifier for human pancreas cell types, demonstrating its general 74 applicability. 75 A scalable framework for robust scRNA-Seq cell type classification  78   To design a scalable framework for constructing robust cell type classifiers from large, heterogeneous  79 scRNA-Seq datasets, we observed that cell types in a given tissue are often closely related biologically. 80

77
For example, all subtypes of mature T cells found in peripheral blood originate from thymocytes, and 81 can be expected to share similar transcriptomes. We therefore reasoned that a hierarchical approach 82 would be most appropriate for cell type classification. In the example, a classifier could first generally 83 distinguish T cells from the remaining cell types, and then further distinguish among the T cell subtypes. 84 Importantly, this would allow the two classifiers to operate at different levels of resolution, and to utilize 85 distinct sets of genes for classification, which would be more difficult to accomplish using a single 86 classifier. With this hierarchical approach in mind, we designed effective and integrated methods for 87 clustering, classification, and validation, which together comprise the Moana framework (Figure 1a). 88 Training of a machine learning classifier requires a dataset in which expression profiles are associated 89 with cell type labels. As such labels are typically unavailable, it is necessary to first perform clustering 90 with the goal of labeling each cell with a cell type or subpopulation. Given the high levels of technical 91 noise, clustering of scRNA-Seq data can present a difficult challenge, for which numerous methods have 92 been proposed 13 . For the purposes of our framework, we designed a new clustering method (Figure 1b) 93 that takes advantage of our hierarchical approach by partitioning the cells into a few clearly distinct 94 subpopulations, instead of resolving all cell types simultaneously. First, we aggressively smooth the data 95 using our previously developed kNN-smoothing algorithm 11 , thereby removing as much technical noise 96 as possible. We then perform clustering directly in two-dimensional PC space using DBSCAN, a standard 97 clustering algorithm 18 . This allows a direct visual inspection of clustering results, and an examination of 98 the expression patterns of known cell type markers (Figure 1c). Subsequently, additional subtypes can 99 be identified by repeating the procedure for each subpopulation (Figure 1d). 100 Our clustering results suggested that with sufficient smoothing, even closely related cell types would 101 become linearly separable in principal component space. However, overly aggressive smoothing can also 102 introduce artifacts. In our framework, we therefore predict cell types using linear support vector 103 machines (SVM), trained on minimally smoothed and PCA-transformed data (Figure 1e). Each principal 104 component captures an expression module consisting of many genes, thereby reducing the impact of 105 any single gene on the classification outcome. Additionally, SVM rely on the principle of maximum 106 margin classification to achieve optimal classification performance 19 . Our classification model is 107 therefore designed to exhibit maximal robustness with respect to protocol-related differences or 108 unforeseen perturbations to the transcriptome. The "chaining" of individual classification models results 109 in a hierarchical classifier that is able to accommodate a large number of cell types and subtypes in a 110 given tissue (Figure 1f). 111 Datasets consisting of more than 10,000 cells present computational challenges related to the time and 112 memory required to complete each of the analysis steps. We found that our hierarchical approach to 113 constructing cell type classifiers allowed for a straightforward solution to this problem (Figure 1g). First, 114 we generate a training dataset by randomly sampling a small subset of cells from the entire dataset. 115 Clustering and training of a Moana classification model is performed on this training dataset, and the 116 resulting classifier is then used to predict the cell types in the entire dataset. The procedure can then be 117 repeated for each subpopulation, which allows the construction of a complete hierarchical classifier 118 without ever having to analyze the full dataset. A detailed technical description of all components of the 119 Moana framework is provided in the Methods. 120

Construction and initial validation of a cell type classifier for human PBMCs 121
We applied our framework to construct a cell type classifier for human peripheral blood mononuclear 122 cells (PBMCs), which present a difficult classification challenge owing to the relatively low RNA content 123 of these cells (Supplementary Figure 1), and the fact that several cell types such as CD14+ and CD16+ 124 monocytes are closely related, and are therefore expected to share similar transcriptomes. Using a 125 PBMC dataset provided by 10x Genomics (PBMC-8k, n=8,381 cells), we constructed a Moana classifier 126 comprising eight binary classification models, thus distinguishing between nine immune cell types: 127 CD14+ monocytes, CD16+ monocytes, dendritic cells (DCs), B cells, NK cells, and four subtypes of T cells 128 (Figure 2a and Methods). Our clustering method successfully identified subpopulations in the clustering 129 steps required for the construction of this classifier (Supplementary Figures 2 and 3), demonstrating its 130 effectiveness at different levels of resolution. While broadly distinguishing between T/NK cells, B cells, 131 and Monocytes/DCs did not require any smoothing (k=1), smoothing was essential to accurately 132 distinguish between most cell types in the training data (Figure 2a and Supplementary Figure 4a). 133 To validate our classifier, we obtained a second 10x Genomics dataset (PBMC-4k, n=4,340; cells 134 obtained from the same donor as for the training dataset). We clustered the cells into the same nine cell 135 types as before (Supplementary Figure 5), and manually excluded 37 cells that we identified as 136 plasmacytoid dendritic cells (Supplementary Figure 4c). We then applied our PBMC classifier to this 137 dataset, and used the clustering results as the "ground truth" for assessing the accuracy of the cell type 138 predictions (Figure 2b). Prediction accuracies were generally high, with precision scores of above 98% 139 for CD14+ monocytes, dendritic cells, B cells, and NK cells. Approx. 20% of cells predicted to be CD16+ 140 monocytes were annotated as CD14+ monocytes during clustering. CD14+ and CD16+ monocytes did 141 not form distinct clusters in t-SNE space, suggesting that the difficulty in accurately distinguished 142 between them could be explained by their high transcriptomic similarity. While precision scores for T 143 cell subtypes were slightly lower, ranging between 80-90%, this was still a remarkable result given that 144 the naïve CD4+ and CD8+ T cell populations were completely overlapping in t-SNE space (Figure 2b). We 145 used the predicted cell types to calculate average expression profiles for each cell type, which revealed 146 near-perfect correlations between the training and the validation data (Figure 2c and Supplementary 147 Figure 4d). These results demonstrated the ability of our classifier to accurately distinguish between 148 closely related immune cell types in a validation dataset. 149 Our approach of combining PCA with linear SVM classification allowed us to quantify how informative 150 each gene was for distinguishing between individual subpopulations (Figure 2f and Methods), which 151 fundamentally depends on its cell type specificity as well as its absolute expression level. We collected 152 the most informative genes from all subclassifiers and visualized them as a heatmap (Figure 2g and 153 Supplementary Figure 6), which showed that these included both widely expressed genes such as B2M 154 (Beta-2-microglobulin), as well as genes with highly cell type-specific expression such as S100A8 and 155 S100A9 (expressed only in CD14+ monocytes and dendritic cells). This demonstrated that by choosing a 156 hierarchical approach to cell type classification, we were able to utilize information from many more 157 genes than if we had tried to construct a classifier based solely on genes with highly cell type-specific 158 expression patterns. 159

Systematic assessment of classification performance in the absence of a ground truth 160
The generation of a "ground truth" through manual clustering can be time-consuming and carries the 161 risk of introducing subjective biases or mistakes, which can result in the over-or underestimation of 162 classification accuracies. Unfortunately, a more reliable ground truth, such as scRNA-Seq data from 163 experimentally purified subpopulations, is frequently unavailable. To overcome this limitation, we 164 devised an algorithm which we refer to as mirror validation (Figure 2c and Methods). In this approach, 165 we use the trained classifier to predict cell types in a validation dataset from the same tissue, and then 166 use the predicted cell type labels to train a new "mirror classifier" on this dataset. We then apply the 167 mirror classifier to the original training dataset, and assess the degree with which the predictions from 168 mirror classifier agree with those of the original classifier. We reasoned that whenever the original 169 classifier fails to accurately identify its cell types in the validation dataset, it would not be possible for 170 the mirror classifier to accurately recapitulate the original cell type assignments in the training dataset. 171 This lack of "coherence" would therefore indicate a failure to accurately determine cell types in the 172 validation dataset. 173 We applied mirror validation using PBMC-4k as the validation dataset, and found that coherence 174 scores, which combine precision and recall values into a single measure (Methods), were at or above 175 90% for all cell types (Figure 2e), suggesting high classification accuracies. By examining the validation 176 results for individual classification models, we observed that the validation accuracies for highly 177 imbalanced classes (e.g., T cells and NK cells) could be further improved to over 98% by lowering the ν 178 parameter of the mirror classifier (Supplementary Figure 4b). However, independent of the choice of ν, 179 the accuracies for the classifiers distinguishing monocyte and T cell subtypes remained slightly lower 180 than for the other cell types. Therefore, the mirror validation results closely agreed with the results 181 obtained using the manually generated "ground truth", demonstrating that mirror validation allowed 182 the effective assessment of classification performance in the absence of a ground truth. 183 Accurate prediction of PBMC cell types in data generated using a different scRNA-Seq protocol 184 Given the diversity of scRNA-Seq protocols developed 4 , scRNA-Seq classifiers should ideally be able to 185 accurately determine cell types in data generated using a different protocol, which may result in 186 substantial systematic expression differences. We therefore aimed to assess the ability of our classifier 187 to accurately predict PBMC cell types for data generated using 10x Genomics' "v1" chemistry, as 188 opposed to the "v2" chemistry used to generate PBMC-8k. A t-SNE analysis of a combined dataset with 189 both v1 and v2 PBMC expression profiles showed that even after normalization, cells clustered by 190 dataset, not cell type, suggesting the presence of large systematic differences that we did not observe 191 between datasets generated using the same protocol (Figure 3a and Supplementary Figure 7d). When 192 we applied our classifier to the v1 PBMC dataset (v1-PBMC-16k, n=16,000), cells of all cell types were 193 identified, although cell type proportions differed significantly from the v2 training data (Figure 3b and  194 Supplementary Figure 7a). Using the predicted cell types, we calculated average expression profiles for 195 each cell type, and found that the protocol-dependent differences indeed appeared larger than many 196 cell type differences (Figure 3b and Supplementary Figure 7b, c). 197 The availability of v1 datasets of purified PBMC subpopulations 20 allowed us to use an experimentally 198 generated ground truth to assess the accuracy of our classifier. The classifier identified B cells, T cells 199 and NK cells with >97% accuracy, but predicted the presence of 19% dendritic cells in a dataset 200 representing purified CD14+ monocytes (Figure 3c). We did not explore the extent to which this could 201 reflect an experimental contamination associated with the negative selection protocol used by Zheng et 202 al. 20 , as opposed to a classification error. We next applied the classifier to data from experimentally 203 purified T cell subpopulations, and found that more than 99% of total CD8+ T cells were predicted to 204 belong to the CD8+ lineage, while 85% of total CD4+ T cells were predicted to belong to the CD4+ 205 lineage (Figure 3d). In addition, we found that classification accuracies ranged from 94-99% for naïve 206 CD4+, naïve CD8+, and memory CD4+ T cells (Figure 3e). Overall, these results demonstrated the ability 207 of our classifier to accurately distinguish between closely related cell types in data exhibiting substantial 208 systematic differences associated with the use of a different scRNA-Seq protocol. 209

Moana outperforms a previously described method for scRNA-Seq cell type prediction 210
To directly compare our classification results to those obtained with a previously proposed classification 211 method, we reimplemented the scmap-cluster method 17 , a non-hierarchical approach that relies on 212 gene selection and correlation measures to predict cell type identities. We first ensured that we were 213 able to accurately reproduce the classification results from the original study (Supplementary Figure 8). 214 We then trained an scmap-cluster classifier on the PBMC-8k dataset, while using the PBMC-4k dataset 215 to determine the optimal number of genes (Methods). We then applied this classifier to the 216 experimentally purified PBMC subpopulations. We found that B and NK cells were predicted accurately, 217 while accuracies were much lower for all other cell types (Supplementary Figure 9). In particular, for all 218 purified T cell subpopulations, the classifier left between 28% and 70% of cells "unassigned", indicating 219 an inability to conclusively determine cell type identities. These results showed that the scmap-cluster 220 classifier failed to achieve a similar resolution as our Moana classifier when presented with data that 221 exhibited substantial batch effects. 222

Construction and validation of a classifier for human pancreas cells 223
Moana is a general framework for constructing cell type classifiers for complex tissues based on scRNA-224 Seq data. To demonstrate its applicability beyond PBMC samples, we constructed a human pancreas cell 225 type classifier based on scRNA-Seq data reported by our lab (Baron16-3, n=4,591; Methods). 21 We 226 visually confirmed the presence of batch effects, as well as the ability of our classifier to predict cell 227 types independent of those effects (Figure 4a). Cell type composition was highly variable across cell 228 types (Figure 4b), but mirror validation showed that cell type predictions were highly coherent, thereby 229 validating the accuracies of our predictions (Figure 4c). Previous work relied on well-established marker 230 genes to identify individual cell types, and we found that our predictions allowed us to recover those 231 marker genes in an unbiased fashion (Figure 4d and Supplementary Figure 10a), further confirming our 232 classification results. Finally, since some of these marker genes exhibited very high expression levels, we 233 asked if it would be possible to predict cell type identities based on only a single marker gene for each 234 cell type. We found that this was the case for the highly specialized endocrine cell types, but not for 235 stellate and ductal cells (Figure 4e). In contrast, among PBMCs, we were only able to predict B cells 236 using a single marker gene (Supplementary Figure 10b,c). Our results confirmed the general 237 applicability of our framework, and showed that while distinguishing between the endocrine cell types 238 was indeed a trivial task, our classifier also enabled the accurate prediction of the remaining cell types in 239 the pancreas. 240 In addition to exhibiting robustness to batch effects, an scRNA-Seq cell type classifier should also be able 242 to identify cell types following in vivo or in vitro perturbations to the "normal" cell state. To test 243 whether a Moana classifier would be able to accurately identify cell types following an experimental 244 perturbation, we decided to predict cell type identities for PBMCs following exposure to the cytokine 245 interferon beta (IFN-β) 22 . This treatment was reported to result in widespread transcriptomic changes 246 across cell types and the downregulation of key cell type markers 15,23 , thus providing an interesting test 247 case for the robustness of Moana classifiers. We first used mirror validation to test the performance of 248 our original PBMC classifier on the unexposed control dataset (Kang18-Ctl, n=12,757), which 249 indicated that it did not accurately distinguish between CD14+ and CD16+ monocytes, nor between T 250 cell subtypes (Supplementary Figure 11a). We observed that the unexposed cells exhibited dramatic 251 transcriptomic differences relative to our original training data (Supplementary Figure 11c), potentially 252 resulting from their culturing for the duration of the experiment 22 . Evidently, these differences were too 253 great to allow an accurate classification of subtypes using our original PBMC classifier. We therefore 254 decided to use the control dataset to train a new PBMC classifier (Supplementary Figure 12 and  255 Methods). The structure of the classifier largely mirrored that of our first PBMC classifier, however we 256 were unable to accurately distinguish T cell subtypes using our clustering method, perhaps owing to the 257 significantly lower transcript counts in this dataset (Supplementary Figure 1a). We then applied this 258 classifier to predict the cell types in the treatment dataset (Kang18-Tx, n=13,551). 259 We visually confirmed the presence of significant expression differences between treated and untreated 260 cells (Figure 5a), and found that our classifier identified cells from all major cell types with similar 261 proportions in both conditions, in agreement with previous analyses of the same data 15,22 (Figure 5a,b). 262 We next used the treatment dataset to perform mirror validation, which resulted in high coherence 263 scores for all cell types except for dendritic cells (Figure 5c and Supplementary Figure 11b). We also 264 compared our cell type predictions to the cell type assignments from a previous analysis by Butler et al.,265 who performed clustering after "alignment" of the two datasets 15 , and found that for both datasets, our 266 predictions strongly agreed with these annotations, with the exception of dendritic cell predictions in 267 the treatment dataset (Supplementary Figure 11d). As dendritic cells represented only 2% of cells in the 268 training data, it is possible that additional parameter tuning was necessary to achieve better 269 classification robustness for these cells. Overall, these results demonstrated the ability of a Moana 270 classifier to accurately predict cell types following an experimental perturbation that induced 271 widespread transcriptomic differences. 272

Expression analysis based on predicted PBMC cell types suggests a broadly shared response to IFN-β 273
Finally, we aimed to use the predicted cell type identities as the basis for a cell type-specific analysis of 274 differential expression following exposure to IFN-β. A previous analysis that applied clustering to 275 "aligned" datasets had reported a number of genes that were strongly induced across all cell types, as 276 well as other genes with more cell type-specific induction 15 . We defined a "physiological" expression 277 level for each gene as the maximum expression level across all cell types in the control, and then 278 compared the expression levels in the treatment condition to this level. Our results suggested the 279 presence of at least two distinct responses to IFN-β (Figure 5d): First, a shared response that involved a 280 strong upregulation of more than 50 genes in all cell types, and second, a similarly strong dendritic cell-281 specific response. Since dendritic cells (DCs) represented a rare population and we did not distinguish 282 between plasmacytoid and myeloid DCs, we decided to focus our analysis on the shared response. 283 Unexpectedly, we found that in addition to genes previously reported as universal IFN-β markers, this 284 response encompassed several genes such CXCL11, CCL8 and CMPK2 that were previously reported as 285 having a cell type-specific response 15 (Figure 5d, e). As the analysis was performed on the same dataset, 286 using closely matching cell type assignments (Supplementary Figure 11d), this discrepancy likely 287 resulted from different approaches to quantifying differential expression. Overall, these results 288 demonstrated that Moana cell type predictions allowed the systematic quantification of cell type-289 specific differential expression following exposure to IFN-β, and suggested that independent of cell type, 290 PBMCs elicit a common response that results in the strong upregulation of a substantial number of 291 genes. 292

293
In this work we have introduced Moana, a generally applicable machine learning framework that 294 enables the construction and deployment of accurate and robust cell type classifiers for high-throughput 295 scRNA-Seq data. We applied this framework to construct PBMC classifiers, which we were able to show 296 allowed the accurate prediction of cell types in data generated using different protocols, as well as 297 following perturbations of the transcriptome. To our knowledge, this is the first demonstration of an 298 accurate and robust scRNA-Seq cell type classifier for PBMCs, and we propose that future benchmarks of 299 scRNA-Seq classification performance should focus on difficult challenges presented by PBMCs and 300 other tissues that contain small-to moderately-sized cells and closely related cell types. During the 301 preparation of this manuscript, Alquicira-Hernández et al. 24 proposed a classification framework that 302 shares some basic building blocks with Moana, in that it involves the training of SVM classifiers on PCA-303 transformed data. Despite substantial differences in methodology, we think that the independent 304 proposal of combining PCA with SVM for scRNA-Seq cell type classification speaks to the intuitive appeal 305 of this approach. However, the scope of the current work significantly exceeds that of previous 306 reports 17,24 . We describe a hierarchical framework that encompasses methods for clustering, 307 classification and validation, all of which directly integrate a specialized smoothing algorithm to reduce 308 technical noise levels in scRNA-Seq data 11 . Moreover, we show that our framework can be used to 309 construct classifiers that exhibit high prediction accuracies for a tissue composed of multiple closely 310 related cell types, even in the presence of strong batch effects. 311 Our hierarchical approach to clustering and classification represents a key innovation, as it 312 simultaneously enables our framework to scale to large datasets, accommodate a large number of cell 313 types, and perform clustering and classification at distinct levels of resolution. Another key advantage of 314 this approach is the ability to construct individual subclassifiers using different datasets. For example, a 315 subclassifier for T cell subtypes could be trained on a different scRNA-Seq dataset containing only T cells. 316 By collecting scRNA-Seq data after enriching for specific subpopulations, it is thus possible to 317 experimentally overcome the computational difficulty of training (sub-)classifiers for extremely rare 318 populations of cells. In this work, we have proposed to remedy the often widely differing cell type 319 abundances by creating synthetic training datasets in which rare cell types are overrepresented, which 320 helps to increase classification robustness as long as a certain minimum number of cells of a particular 321 type are available. 322 In addition to extremely rare subpopulations, we decided to ignore the presence of doublets (multiplets) 323 in the data. The ability of PCA to capture specific sources of variation in individual PCs allowed us to 324 construct accurate classifiers without removing doublets. However, their identification is important to 325 avoid confusion during the clustering stage, and we would ultimately like to incorporate an automated 326 method for removing these technical artifacts from the data in our framework. We briefly considered 327 modeling doublets as artificial cell types, however this approach did not seem appropriate given the fact 328 that the transcriptomic composition of the doublets depends entirely on the actual cell type 329 composition of the sample. We would instead favor a simulation-based approach such as the one 330 proposed by McGinnis et al. 25 Additionally, to further increase the ability to accurately distinguish 331 between closely related cell types or states, future research may be directed at the selection of 332 informative principal components 24 , and at the incorporation of a gene selection step 26

358
A description of all methods is contained in the Supplement. 359 Figure 1: Overview of the Moana framework. a Schematic of framework components with input and output data (VST=variancestabilizing transformation). b Clustering example using 2,000 cells from the PBMC-8k dataset (see text for details). c Annotation of cell types proceeds by identifying known markers among the most overexpressed genes for each cluster (top left). As an example, expression patterns of the CD3D, LYZ, and CD79A genes are shown (bottom left), which are known to be highly expressed in T cells, monocytes, and B cells, respectively. d Identification of additional subpopulations is accomplished by repeating the clustering procedure on cells from each cluster. e Schematic of the internal structure of the Moana classification model. f Schematic of the internal structure of a Moana cell type classifier, consisting of a hierarchy of independent classification models. g Strategy for efficiently performing clustering and classification for large datasets. A classifier is trained on a random subset of approx. 2,000 cells, and used to predict the remaining cells in the dataset. The procedure is repeated recursively for each subpopulation, enabling the construction of a complete hierarchical classifier without ever having to perform clustering or SVM training on the entire dataset.   (Figure 2a). b Top: Comparison of average naïve CD4 and naïve CD8 T cell expression profiles in the v1 dataset. Bottom: Comparison of average naïve CD4+ T cell expression profiles in the v1 and v2 datasets. c-e Prediction results of the PBMC classifier, for v1 datasets representing experimentally purified PBMC subpopulations (Methods). The "Rest" category encompasses any cell type not listed in the legend, as well as cell types with a predicted abundance of less than one percent. f Mirror validation results for the PBMC classifier, using v1-PBMC-16k as the validation dataset.  Heatmap showing cell type-specific gene expression levels after IFN-β exposure, relative to their maximum expression level across cell types in the control condition. Right: Zoomed-in view of 15 genes with very strong upregulation. Underlined are genes reported as either universal (red) or cell type-specific (blue) IFN-β response markers by Butler et al. 15 . e Detailed view of expression levels with and without IFN-β exposure for three "universal" (top) and three "cell type-specific" (bottom) marker genes.