PCQC: Selecting optimal principal components for identifying clusters with highly imbalanced class sizes in single-cell RNA-seq data

Summary Prior to identifying clusters in single cell gene expression experiments, selecting the top principal components is a critical step for filtering out noise in the data set. Identifying these top principal components typically focuses on the total variance explained, and principal components that explain small clusters from rare populations will not necessarily capture a large percentage of variance in the data. We present a computationally efficient alternative for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation. We then evaluate the efficacy of our approach in three different single cell RNA-sequencing data sets and find that our method matches, or outperforms, other selection criteria that are typically employed in the literature. Availability and implementation pcqc is written in Python and available at github.com/RoussosLab/pcqc


Introduction
In large data sets, identifying hidden clusters of observations is especially challenging due to the prevalence of noisy features. Dimensionality reduction techniques, such as principal component analysis, circumvent this issue by forming composite measures of only the most essential features, allowing researchers to more readily discern meaningful trends within data. We typically select top features by retaining only those that capture the most variance across all observations. This approach, however, limits our ability to derive information from less abundant functional clusters, such as rare populations of cells associated with specific disease [19]. We propose a new methodology to better capture small clusters by selecting candidate principal components based on the distribution of the variance explained across all observations. Existing Methodologies. Since each principal component captures a certain percentage of the variation, we can graphically represent this relationship in the form of a scree plot. To decide the cut-off point, we look for the location of the elbow in the plot, where the amount of information captured by the principal component appears to  level off. Given their conceptual simplicity and computational efficiency, scree plots are commonly used for identifying the top principal components in single-cell RNA sequencing data [5,13]. However, visually determining this cut-off is often challenging as there can be many potential cut-off points. Alternatively, we can assess the significance of the principal components through a permutation test, where we permute the entries in the columns of our data matrix to construct a randomized data set and then evaluate whether a given principal component in the original data set captures more variance than its analogue in the permuted data. Permutation tests require significantly more computational resources than their scree plot counterparts. In addition, ranking the top principal components based on total variance captured runs the risk of ignoring principal components that play an important role in describing data points that belong to smaller clusters.
PCQC: To address the aforementioned issue, we propose the PCQC (Principal Component Quantile Check) criteria, a computationally efficient methodology, where we rank the principal components by extracting a value near the tail of the distribution of variance explained across all observations, as opposed to the total variance explained (Figure 1a,1b). Intuitively, while principal components describing small clusters may not explain much variance in the data as a whole, we anticipate that for a small subgroup of data points, the variance explained is very

Data Sets
To assess the performance of the PCQC method, we consider three different single-cell RNA sequencing data sets.
We used existing and in-house generated multimodal single cell gene expression data to validate the PCQC method.
Fresh PBMCs were subjected to Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) (PMID: 28759029) with single-cell libraries generated using the 10x platform (10x Genomics). This approach permits study of the transciptomes of individual cells while gaining quantitative and qualitative information on a panel of cell surface proteins (CD11b, CD11c, CD123, CD14, CD16, CD27, CD3, CD4, CD40, CD45, CD45RA, CD56, CD8 and HLA_DR) through the use of barcoded antibodies. Additionally, we assess the efficacy of our approach on a CBMC data set produced using CITE-seq and a PBMC data set generated through single-cell RNA-seq [2,7]. We construct our ground truth labels, by clustering the smaller dimensional antibody derived tag (ADT) data from our CITE-seq experiments or by leveraging marker genes to label different cell types based on prior studies. To better understand how dimensionality reduction impacts downstream clustering analysis, we also generate a bootstraped variation of the dataset, where we re-sample the data based on empirically observed distributions of frequencies of different cell types.
We preprocessed data based on common practices and filtered out doublets or low expression single cell barcodes [3,5,8,18]. Based on the recommended procedure from the Seurat and Scanpy pipelines, we cluster the transcript count data by embedding the data in a k-nearest neighbor graph and extract the hidden clusters using a Louvain or Leiden community detection algorithm. Subsequently, we train a multinomial Bayes model to map the unsupervised clustering to the ground-truth labels.

Results
From Table 1, we observe that PCQC consistently matches, or outperforms, alternative approaches, including Scree Plots and Permutation Tests. For example, when classifying the rare dendritic cell type, the PCQC approach decreases the median log-loss score by more than 13%.