scMARK an ‘MNIST’ like benchmark to evaluate and optimize models for unifying scRNA data

Today’s single-cell RNA analysis tools provide enormous value in enabling researchers to make sense of large single-cell RNA (scRNA) studies, yet their ability to integrate different studies at scale remains untested. Here we present a novel benchmark dataset (scMARK), that consists of 100,000 cells over 10 studies and can test how well models unify data from different scRNA studies. We also introduce a two-step framework that uses supervised models, to evaluate how well unsupervised models integrate scRNA data from the 10 studies. Using this framework, we show that the Variational Autoencoder, scVI, represents the only tool tested that can integrate scRNA studies at scale. Overall, this work paves the way to creating large scRNA atlases and ‘off-the-shelf’ analysis tools.

: scRNA studies are on the rise: The number of cells published per month is increasing exponentially, as per Svensson et al. [1]. We are also seeing the sequencing cost per cell fall rapidly; internal analysis, key reference points used, e.g., Zhang et al. [13] 2 A framework for evaluating scRNA model performance We propose a solution to these challenges whereby (1) We assume that, on average, author labels are correct. Thus, we can test how well models predict author labels on data from held-out studies ( Fig. 2A); and (2) We assume that cell types can be easily separated in good cell-type spaces, while separation is challenging in poor cell-type spaces. Therefore, we can use a collection of simple supervised models to evaluate how good a cell-type space is by seeing how well they can label cell types on a held-out dataset ( Fig. 2B & 2C).
Under these assumptions, we can therefore identify top-performing models by (1) Generating a benchmark dataset comprising cells from stratified samples from a set of author labeled scRNA studies, and with harmonized cell-type and gene labels; (2) Using unsupervised scRNA analysis models to reduce gene-space to an aligned cell-type space; and then (3) Evaluating the quality of this cell-type space by training a collection of supervised models to label cell-types on all but a held-out study that we use to assess performance (with cross-validation).

Generation of the scMARK benchmark dataset
We unified ten scRNA studies (Table S1), taking a random 10,000 cell sample from each study. Our goal was to maintain author cell-type labels as closely as possible, while standardizing labels across studies. Our harmonization process recorded 28 cell-type labels across the studies ( Fig. 3A and Table  S2). We also harmonized gene naming conventions using the latest versions of the Human Protein Atlas [14], and ENSEMBL [15] databases. Genes were kept that existed at the intersection of at least 2 of the ten studies (Fig. 2B). Of note, Bi et al. 2021, reported vastly more gene identifiers than other studies, as this study also captured non-coding RNA, whil, Azizi et al. 2018 reported less, likely due to the use of the inDrop in this study, vs. the others that used a 10X instrument. Overall, this benchmark dataset represents the first effort to unify studies with author specified ground-truths, as has previously been proposed as a gold standard [8], but until now not addressed owing to the harmonization challenges we solve here.

Model performance evaluation
For the unsupervised step of our evaluation framework we tested; (a) Principal Component Analysis (PCA) applied directly to all genes (PCA all ), (b) Reciprocal PCA as per Seurat using the 'SC-Transform' data slot (RPCA sct ); (c) Reciprocal PCA as per Seurat using the 'integrated' data slot (RPCA int ) [16] (d) PCA applied to a subset of genes selected as described in scmap (PCA scmap ) [17]; and (d) Variational Auto-Encoders (VAEs), implemented as per single-cell Variational Inference models described in Lopez et al. (scVI) [18]. For PCA all , PCA scmap and scVI we used the log Figure 2: A framework for evaluating model performance at integrating scRNA datasets: A) In our evaluation framework, we use unsupervised models to reduce the complete set of single-cell gene expression matrices into a unified cell-type embedding space. We then train a collection of supervised models to predict author labels from all but one held-out dataset in this unified cell-type space. We then evaluate the performance of the supervised models on the held-out dataset with traditional accuracy metrics, e.g., F1 score; B) We propose that supervised models find it easy to predict cell-type labels in new studies, in high-scoring cell-type spaces, since cell-types are well aligned; C) In low-scoring cell-type spaces, supervised models find it challenging to label cell-types since poor alignment makes it difficult to predict cell-type labels. normalized gene expression matrices from all 10 studies. Whereas, for RPCA sct and RPCA int we used the SCTransform normalization. Each unsupervised unified cell-type space was input to four cell-type classifiers: XGboost (XGB), Gaussian Naive Bayes (GNB), K-nearest neighbors (KNN), and Support Vector Machine (SVM) models, which acted as our collection of supervised models to evaluate the cell-type space.
Based on this framework, we found that scVI outperformed other approaches using the scMARK benchmark dataset ( Fig. 4A; Table S2). PCA all and PCA scmap generated results comparable to each other. Finally, RPCA sct performed poorly on this benchmark. Importantly, we found minimal variation in the mean F1 score between supervised methods in contrast to the unsupervised integration methods, implying we are successfully evaluating the unsupervised model. Overall, this comparison shows clear differences in performance, validates our two step strategy for testing unsupervised models and shows that scVI stands out above other methods.

Behavior as dataset size increases
We next assessed how computation time and performance vary with increasing dataset size to understand the scalability of the approaches. We increased samples from 500 cells per dataset (5k total) to 20k cells per dataset ( 200k total). We measured both compute time and the mean F1 score, as calculated by our collection of supervised models ( Fig. 4B & 4C). PCA all and PCA scmap completed the task in minutes, but the F1 score did not increase. RPCA compute times increased considerably and performance dropped off. In contrast scVI performance improved with increasing dataset size, while train time only increased linearly up to 2hrs for 200k cells. scVI also had reduced memory requirements taking 128 GB memory. The same integration with RPCA sct took >600 GB of memory. Overall this analysis demonstrates that scVI was the only tested method to scale and leverage larger training set sizes.
Finally, we tested model performance while increasing the dataset number from 3, to 5, to 10. We selected alphabetically) to test the effect of additional training data on performance vs. the addition of datasets that may be easier or more difficult to predict. scVI outperformed the other methods even at lower dataset numbers (Fig. 4D). At five datasets, performance plateaued. Of note, at ten datasets, 300 unique batch IDs exist. We speculate this result leaves room for new models to improve on scVI by making better use of the additional datasets, or better handling large batch numbers.

Characteristics of the unified atlas generated by the top-performing model
Our evaluation framework showed scVI outperformed other methods. To qualitatively assess the results of dataset integrations we used UMAP to project the 10-dimensional embedding into a 2 dimensional space [19].
Coloring the projection by study shows a high level of overlap between all studies (Fig. 5A). Coloring by the standardized author cell-type labels shows a tight grouping between cell-types ( Fig. 5B). For example, CD4+ T-cells, CD8+ T-cells, and NK cells form a large leukocyte cluster and have a common lineage, epithelial lineages fall in a similar region of the projection to each other, and Fibroblastic and endothelial cells form independent groups. A critical hypothesis of our evaluation framework is that we can find new cell-types in properly unified cell-type spaces. FOXP3+ T-cells are a well-characterized CD4+ T-cell subtype, implicated in immune suppression, and known as regulatory T-cells (T-regs). Within the T-cell cluster, FOXP3+ expression localizes to a distinct region of the CD4+ T-cell space (Fig. 5C) that contains cells from all ten studies. This result shows that we could find this unlabelled cell subtype by analyzing sub-regions of the CD4+ T-cell cluster. This qualitative assessment, thus, highlights the value of the unified cell-type space generated by scVI.

Limitations
Finally, we sought to understand which datasets scVI performed well vs. poorly. We generated confusion matrices for three datasets covering the spectrum of scVI integration performance, using the SVM predictions. Firstly, we examined Azizi 2018, a dataset generated on InDrop. All four unsupervised methods perform poorly on this dataset. Examination of the confusion matrix indicates the dataset is limited to immune cells (Fig. 6A). Myeloid sub-types were difficult to separate in Figure 4: Performance of models on the scMARK benchmark dataset: A) Five feature reduction approaches were compared with four cell-type classifiers. The results highlight that scVI outperformed other approaches based on all supervised models (Table S3 for detailed results); B) Compute times for the unsupervised steps increasing the number of cells per study from 300 to 20,000. The following virtual machines (VM) were used: scVI -GPU VM, 128GB memory; PCA all and PCA scmap -CPU 4 vCPUs, 160 GiB memory; RPCA sct used a CPU VM -4 vCPUs, 976 GiB memory. C) The mean F1 score as cell-sample size increases is recorded for each unsupervised integration and supervised cell-type classification model pairing. D) Similar to A, but keeping 10,000 cells per study constant while varying the number of training studies (2, 4, or 9). In all tests, the F1 score resulted from leave-one-out cross-validation on the supervised step.
this and other datasets (e.g., Qian 2020), indicating this may be an area for improved ground-truth labeling. In contrast, NK, CD4+, and CD8+ T-cells were difficult to separate in Azizi et al. 2018, but easy elsewhere indicating here, scVI is struggling to transfer learnings from the 10X Chromium to the inDrop. In the confusion matrix of Qian et al. 2020, we found models encountered issues splitting cancer cells from epithelial cells and dendritic cells from macrophages (Fig. 6B). These cell-types are of similar cell lineages in both cases, suggesting better models or larger datasets may lead to small improvements. Accuracy was highest on Ma et al. 2019 (Fig. 6C); this is likely due to high-level cell-type labels provided by the authors, showing more detailed cell-type labels create complexity. Overall, models trained on the scVI cell-type space perform well but there is likely room for improvement, both in model quality and developing benchmark datasets with more granular cell-type labels.

Discussion
To date, most benchmarks have used individual studies to test the performance of machine-learning models for analyzing scRNA data. Yet such approaches are a low hurdle. Batch effects within a study Figure 5: Visual inspection of the results of dataset integration with scVI: A) A UMAP projection of the 10-dimensional embedding space produced by the VAE highlights broad levels of overlap across all studies; B) The same UMAP projection but colored by standardized author cell-type labels (see Table S1 for harmonization details); C) A zoom-in of the T and NK-cell cluster shows FOXP3+ expression localizes to a subregion of CD4+ T-cells; this agrees with existing knowledge about FOXP3+ expression on regulatory T-cells, a CD4+ T-cell subset are much smaller than those between studies; thus very high-levels of performance are typically seen [4,5,20]. Model performance is also commonly tested with statistical metrics [3,21]. However, how accurately a model aligns cell-types remains an open question under these metrics since there a ground truth is not used. More recent studies use ground-truth labels to test supervised cell-type labeling models on new, unseen studies [4,22,23,7]. Yet, supervised models limit new cell-type discovery; worse, such models can perform well by forcing cell-types into buckets that do not represent their actual state. Collectively these experiments highlight the need for a high-hurdle benchmark that tests models' ability to align disparate data in a real-world setting.
Here we present, to our knowledge, the first framework and dataset (scMARK) for quantitatively evaluating how well methods can unify scRNA gene-expression matrices into a unified cell-type space that avoids the pitfalls and complexities of using statistical metrics of performance. We achieve this through (1) Creating a benchmark dataset to test how well models can integrate disparate data from different authors; (2) Using an unsupervised scRNA model to integrate the studies into a single unified cell-type space; and (3) Evaluating this space with a collection of supervised models. We show that scVI, a Variational Autoencoder, optimized to manage scRNA batch-effects, outperforms other approaches. We find that scVI also represents the only tested method that benefits from larger training datasets. Qualitative assessment of the unified cell-type space indicates that the scVI embedding is suitable for automatic cell-type labeling and discovery of new cell-types. Thus we find that scVI, and VAE's more generally represent a solution to scaling unification of scRNA-seq data. We hope this work to develop an MNIST-like scRNA benchmark will pave the way to the generation of larger datasets akin to ImageNet [24]. We speculate robust models derived from larger ImageNet like datasets will, in turn, be critical to both the construction of tissue and disease atlases, and use of scRNA technologies in a clinical setting.

Methods
Collection and standardization of scRNA data: Raw scRNA UMI count matrices were obtained from public repositories (Table S1) and quality control followed the original author filters. Gene identifiers were standardized across studies based on (i) Human Protein Atlas (HPA) versions 13 to 20; and (ii) ENSEMBL GRCh38 versions 78 to 103. Priority was given to HPA identifiers. In cases where the authors provided only general T-cell annotations, we used Azimuth [15] to assign those cells into CD8+ or CD4+ T cells.
scRNA data normalization: For scVI, PCA all and PCA scmap , the count matrices were normalized on a per-cell basis using Scanpy v1.7.2, by dividing each cell by its total count over all genes. The normalized count was then multiplied by a scale factor of 10000, after which a log(X+1) transformation was applied. For RPCA, Seurat's SCTransform normalization was used with default parameters [15].
Unsupervised cell-type space methods: For PCA, we experimented with two types of gene selection methods: PCA scmap and PCA all . We used the TruncatedSVD method from sci-kit learn (v0.24.2) library instead of the PCA method because TruncatedSVD is more efficient for sparse matrices. For PCA scmap , we applied [17] scmap method to the normalized data to extract the top 500 genes, and ran PCA on this. For PCA all , we applied PCA to the normalized matrix containing all genes.
We reimplemented the scVI variational auto-encoder described by [18]. We used sample level batch-correction and the following hyper-parameters: both encoder and decoder had two linear layers, with 1024 nodes for each linear layer. For both encoder and decoder, dropout regularization (with 0.1 probability of an element to be zeroed) and batch normalization was used in between two hidden layers along with ReLU activation function. The latent space dimension was set to 10 and modeled using the Normal distribution. A zero-inflated negative binomial distribution was used to model gene counts. The Adam optimizer was used for training the VAE with learning rate=5e − 4, weight decay=1e − 5 and eps=0.01. A step learning rate scheduler was used with a step size=50 epochs and lr factor=0.1, and early stopping with patience=10 epochs. The model was trained with a batch-size of 32 or 64 and for a maximum of 100 epochs. All the implementation was done in Python using Pytorch (1.7.0) library.
The RPCA method was implemented in R using Seurat (v4.0.3) [16], with the top-10 larger samples used as references for anchor detection and the following parameters (dims=10, npcs=10, k.filter=150, k.weight=100). Samples with <150 cells were merged into a single batch with others from the same study and tissue type. For all RPCA analyses we used 4 workers, as a higher number failed to retrieve the result from the MulticoreFuture parallelization function. The output of the functions RunPCA (npcs=10) and RunUMAP (n.components=10), with assay="SCT" or "integrated" were used as inputs for the supervised cell-type classifiers.
Supervised cell-type classification methods: Supervised cell-type classification methods: A 90/10 split of the training datasets was made for training/validation and GridSearchCV from the sci-kit learn package (v0.24.2) was used to test and select hyperparameter values (Table S4) based on the best F1-score on the validation set. For XGboost, we used XGBClassifier method with multi-softmax objective function from XGboost (v1.4.0) python library. The remaining cell-type classifiers (GNB, KNN and SVM) were implemented using sci-kit learn. We used GaussianNB, KNeighborsClassifier, and SVC (with output probability) method for GNB, KNN and SVM respectively from sci-kit learn library.