A monotonicity-based gene clustering algorithm for enhancing clarity in single-cell RNA sequencing data

Single-cell RNA sequencing (scRNA-seq) technologies and analysis tools have allowed for meaningful insight into the roles and relationships of cells. However, high dimensionality, frequent dropout values, and technical noise remain prevalent challenges for scRNA-seq data, obscuring the already complex expression patterns. To address several shortcomings in commonly used distance metrics, we present a monotonicity-based distance metric designed to enhance the clarity of scRNA-seq data. We apply our metric in a gene clustering algorithm, which we run on several biological datasets. We compare our results to those generated by popular clustering algorithms to demonstrate that our algorithm has substantial ability to improve the accuracy of subsequent cell clustering.


Background
A scRNA-seq dataset consists of an expression matrix of genes by cells, measuring the degree to which each gene is expressed in each cell. Many computational methods have been developed for using scRNA-seq data to identify subpopulations of cells [1, 2,3]. Due to the high dimensionality of scRNA-seq data and the predominance of dropout values, the data tends to be sparse, causing unsupervised cell clustering algorithms, which compute and use the distances between cells, to be naturally susceptible to the curse of dimensionality [4,10]. As a result, similarities between cells may be difficult to interpret. Given that the expression values of thousands of genes are measured in most scRNA-seq datasets, many co-expressed genes are likely to be present, rendering cell-to-cell distances even more difficult to interpret [6]. Therefore, we hypothesized that pre-processing scRNA-seq data by clustering together co-expressed genes has the potential to organize data for easier interpretability by cell clustering algorithms.
Gene clustering algorithms can be classified as supervised or unsupervised. Supervised algorithms usually aim to cluster genes with a shared functional category [7], whereas unsupervised algorithms cluster genes based on expression data without a priori information [6]. The lack of readily available functional annotations for all genes and the sensitivity of scRNA-seq data to the nuances of varying biological settings suggest that more individualized clusters are obtainable from unsupervised approaches [8], which are what this paper will limit itself to discussing.
The high dimensionality and abundant technical noise present in scRNA-seq data make gene expression patterns difficult to accurately model. Many distance metrics have been explored, including Pearson, Spearman, and Euclidean [2,6,9]. Preprocessing procedures, such as quality control, log-transformation, and normalization, have been developed to allow data to be interpreted more easily by clustering algorithms [22]. However, discovering ways to cope with technical noise is a field of ongoing exploration [10], suggesting that gene expression patterns may be too complex and noisy to be effectively modeled by strictly linear correlation (i.e., Pearson correlation) in spite of popular pre-processing methods. This obstacle motivates a distance metric that quantifies the monotonicity (i.e., the tendency for one variable to increase with another, not necessarily linearly) of the relationship between the expression values of a pair of genes. Spearman's correlation and Kendall's correlation are metrics that operate on the relative ranks of values in an effort to quantify monotonicity [14]. However, one shortcoming shared by rank-based correlations is that in the process of compressing values into their ranks, information derivable from the relative magnitudes of the values is lost.
To illustrate, for the graph to the left (which plots the preprocessed expression values of two genes taken from the Biase dataset [24]), the Pearson coefficient is a strong 0.998, and although rank-based monotonicity correlations are designed to be lenient in recognizing non-linear association, the Spearman coefficient is 0.898, and the Kendall tau-b coefficient is only 0.753. For each coefficient, 1 indicates perfect association, and 0 indicates random association. Coefficients were obtained using the scipy Python package [18]. This relationship exhibits a clear tendency for the x coordinate to increase with the y coordinate, but the rank-based correlations are misled into over-magnifying the relatively minor distances among the tightly packed points.
To address the shortcomings in the distance metrics discussed above, we present a monotonicity-based distance metric that accounts for the magnitudes of the distances between points. To demonstrate the effectiveness of our distance metric when applied in a gene clustering algorithm, we compare our algorithm to four other methods (k-means, hierarchical clustering, principal component analysis, and Clust, all of which have yielded strong results in previous work [15,16,23,8]) by running the resulting clustered matrices through SC3, a cell clustering procedure [2]. k is the number of cell clusters, n is the number of cells analyzed, and m is the number of genes after pre-processing (see Methods).

Summary of Datasets
FPKM means fragments per kilobase of transcript per million mapped reads, CPM means counts per million mapped reads, RPKM means reads per kilobase of transcript per million mapped reads, and TPM means transcripts per million mapped reads.
We used each dataset as follows. We first ran SC3 on the unclustered (pre-processed) matrix to generate a benchmark cell clustering. For each of the 6 gene clustering algorithms that we tested (ours, k-means, agglomerative hierarchical clustering using Euclidean distance (HCE), agglomerative hierarchical clustering using Pearson correlation distance (HCP), principal component analysis (PCA), and Clust), we used the algorithm to produce a compressed feature matrix which we subsequently ran through SC3. We used the Adjusted Rand Index (ARI) to evaluate how closely the resulting cell clustering matched the one provided by the publishers of the datasets (see Methods). A higher ARI indicates a closer match, with an ARI of 1 indicating exact correspondence and an ARI of 0 indicating random correspondence. Although SC3 has been shown to produce more stable results than other popular cell clustering algorithms, its performance still experiences fluctuation from run to run [2]. Therefore, we repeated each trial 10 times. Our results are summarized in the table below. Red or green text indicates a statistically significant ARI decrease or increase, respectively (one-sided Wilcoxon signed-rank test p-value < 0.01 [11]). Bolded text indicates a median ARI that differs from the median ARI without gene clustering by 0.1 or more.  CPU times were measured using the Linux time command or the system.time function in R. In the "No gene clustering" column, we recorded how long SC3 took to run on the unclustered data (the SC3 runtime is excluded from the times recorded in the other columns). Since the system.time command is unable to accurately measure the CPU time taken by SC3, the elapsed time was recorded instead (with the number of CPU cores used in SC3 set to 1) in the "No gene clustering" column. See the Methods section for details on how these algorithms were run.
To visualize the effects that our algorithm has on cell clustering, we plot the heatmaps produced by SC3 before and after our clustering, at their respective medians (5th highest ARI), for the Tatsuoka and Treutlein datasets. The SC3 workflow consolidates the clustering results from many combinations of clustering parameters to generate an n × n consensus matrix, visualized as a heatmap. The colors in the heatmap range from dark blue, indicating high confidence that the two cells belong in different clusters, to dark red, indicating high confidence that the two cells belong in the same cluster. The n × n matrix is divided into a k × k block matrix, with the generated clusters occurring along the descending diagonal. In the ideal clustering, the k blocks along the descending diagonal would be completely red and everything else would be completely blue. The cells labels reported by the publishers of the data are color coded above the heatmaps.
On the unclustered Tatsuoka dataset, 7 out of the 24 cells are misassigned.
On our compressed feature matrix, only 3 cells are misassigned.
On the unclustered Treutlein dataset, the E18.5 cells are split into 2 approximately equally sized clusters.
With our compressed feature matrix, the cell relationships among the E18.5 cells are clarified, leading nearly all the E18.5 cells to be clustered together.

Discussion
Identifying cell subpopulations is an important application of scRNA-seq data and has led to notable developments such as the Human Cell Atlas in recent years [6,21]. We use the ARI to quantitatively compare the accuracy of cell clusterings. The gene clustering algorithms all had minimal effects on the ARI in most of the datasets. In some cases, the gene clustering algorithm considerably decreased the ARI (HCP in the Biase dataset and PCA in the Pollen dataset), suggesting that scRNAseq data can be sensitive to feature extraction. However, for both of these datasets, the other clustering algorithms were able to maintain the ARI, indicating that feature extraction with minimal information loss was achievable. Therefore, we established that being able to identify the extent to which clustering should be performed is a crucial attribute of an effective clustering algorithm. In other words, recognizing when meaningful clusters are rare is important in avoiding information loss in cases where feature extraction would disrupt data interpretability. Among the algorithms we tested, our algorithm, PCA, and Clust have adjustable parameters but are still able to produce strong results by using default parameters and rules of thumb without extensive intervention. In fact, Clust's consensus-based workflow renders it arguably too reluctant to cluster genes together, which led Clust to hardly reduce, if at all, the number of features in each dataset. This understanding reveals a valuable advantage of our clustering algorithm (as well as PCA and Clust) over k-means and hierarchical clustering. Our clustering algorithm does not rely on a pre-defined number of clusters, making it more applicable to the abundant cases in high dimensional space where the number of clusters is not known beforehand. For example, our algorithm more than halved the number of features in the Tatsuoka dataset but only reduced the number of features in the Treutlein dataset by less than 10%.
In practice, our algorithm would not have been there to supply k-means and hierarchical clustering with a baseline number of clusters to generate, so our results compared to those of k-means and hierarchical clustering are more pessimistic than reality.
Although our algorithm minimally affected most of the datasets, its drastic improvements in the Tatsuoka and Treutlein datasets (mean and median ARI increased by more than 0.1) demonstrate its substantial ability to improve clarity in scRNA-seq data for cell clustering. Our results show that other clustering algorithms are also capable of clarifying cell-to-cell relationships through gene clustering (k-means and HCE in the Tatsuoka dataset, and PCA in the Shen and Treutlein datasets). Therefore, we present our distance metric and clustering algorithm as a strong option to be explored by future researchers, potentially in a consensus clustering approach. Furthermore, as the algorithms we present are not exclusively applicable to scRNA-seq data, we believe our algorithms can find applications in a wider variety of fields.

Limitations
In this study, although we chose the clustering algorithms that we considered most relevant to scRNA-seq data, the breadth of the clustering techniques we tested could only be so wide. In addition, our algorithm comes at a high computational cost compared to SC3 and the other clustering algorithms we compared (see Methods for the derivation of the time complexity). Therefore, a limitation of our algorithm is lack of scalability to large datasets, especially when the number of genes grows large. A large factor behind the disparity between the runtimes of our algorithm and SC3 is that the number of genes greatly exceeds the number of cells in the datasets we tested. As the number of cells approaches the number of genes, however, this disparity disappears and is even reversed. For example, on synthetically generated data with 500 cells and 500 genes, SC3 ran in 2 minutes and 9 seconds (with k = 5), while our algorithm ran in less than a second. The bulk of our computation comes from computing the monotonicity values, which can be independently computed, allowing for the efficient usage of parallel processing. As a result, our wall runtimes were actually 4 to 5 times shorter than the CPU times listed in Table 1.

Pre-processing
We filter out rare and ubiquitous genes (genes not expressed in >90% or <10% of the cells) with the methods sc3_prepare and get_processed_dataset from the SC3 R package, since such genes are likely to have little informative value [2]. The filtered matrix is then log-transformed and normalized using the logNormCounts method from the scater R package [13,22]. Only the genes whose expression values were recorded across all cell subpopulations (albeit as dropout values) were analyzed because genes whose expression values were recorded in only some cells are likely to differentiate the cell subpopulations too conveniently. This condition was not required for the Yan dataset, in which all of the genes whose expression values were recorded across all cells were ubiquitous genes (presumably because expression values of 0 were preemptively removed). The data presented in the Patel dataset were already log-transformed but without adding 1 to the TPM value, so each data value x was replaced with log 2 (2 x + 1).

Our Algorithm
We define our distance metric as where x and y represent the expression values of 2 genes over n cells. Our monotonicity value relates the positive monotonic behavior to the negative monotonic behavior. A pair of points contributes to the numerator if it exhibits positive monotonic behavior, whereas it contributes to the denominator if it exhibits negative monotonic behavior. In either case, its contribution is the area of the rectangle whose opposite corners are those 2 points.
For a pair of genes, naively computing our metric would require O(n 2 ) time, so here we outline how we reduced the runtime to O(n log n), matching the Spearman correlation time complexity, albeit with a higher constant factor. Assume the points are sorted by their x values. Focusing on the numerator for now, for every point j, we wish to compute i<j yi<yj Distributing the product (and using the same summation condition), we get Similarly, in the denominator, for every point j, we wish to compute i<j yi>yj The Binary Indexed Tree (BIT) is a linear-memory data structure that offers two functions, each executed in logarithmic time: (1) to change every value in a specified prefix of an array by a specified amount and (2) to query the sum of the values in a specified prefix of the array [17]. We update and query a BIT for each of the 8 summations as we sweep through the n points from left to right, for a runtime of O(n log n) time. Two genes with a high monotonicity value are considered to be co-expressed. It has been shown that a negative association between 2 genes' expression values can generate meaningful results regarding gene regulation [12]. However, as this paper limits itself to the application of this metric to clustering genes for subsequent cell clustering, we choose only to cluster together genes that are positively monotonic with each other. From here on, "monotonically expressed" will imply "positively monotonically expressed".
Due to the monotonic nature of our metric and the fact that the monotonicity values are not bounded, the triangle inequality is not necessarily satisfied. In other words, even if genes x and y are monotonically expressed and genes y and z are monotonically expressed, genes x and z might not be monotonically expressed. Therefore, using a conventional clustering algorithm such as k-means or hierarchical clustering with our monotonicity values would generate noisy clusters. Instead, we apply a stricter clustering algorithm which ensures that within a cluster, a randomly selected pair of genes has greater than tightness probability of being monotonically expressed, for some specified probability tightness (default 0.95). A pair of genes is considered monotonically expressed if its monotonicity value is greater than a specified value threshold (default 15). The time complexity of computing the monotonicity value for every pair of genes is O(m 2 n log n), occupying the bulk of the computational cost. As the monotonicity values can be computed independently, parallel processing can be used to more efficiently compute them. Once the monotonicity values have been computed, the clustering algorithm can be performed in O(m 2 log m) time, potentially multiple times with various parameters. The overall time complexity is O(m 2 (n log n + log m)).

Comparison of Gene Clustering Algorithms
To test the various clustering algorithms, each clustering algorithm takes in the pre-processed expression matrix and feeds a compressed feature matrix into SC3 for cell clustering. We chose SC3 because its stability makes it well-suited for testing compared to other popular cell clustering algorithms [2]. The Adjusted Rand Index (ARI) was used to evaluate how closely the resulting cell clustering matched the one provided by the original authors of the datasets [19]. To calculate the ARI between 2 clusterings of n objects, an r × s contingency table N is constructed (where the first and second clusterings generated r and s clusters, respectively) in which N ij is the number of objects clustered into cluster i by the first clustering and into cluster j by the second clustering. a i denotes j N ij , and b j denotes i N ij . For each dataset, SC3 was run on the unclustered expression matrix as well as the compressed feature matrix produced by each of the 6 gene clustering algorithms we tested. Our algorithm was run with its default parameters (a monotonicity threshold of 15 and a tightness value of 0.95). Clust was run with its default parameters (a tightness value of 1) and with the -n 0 flag to skip normalization because normalization was performed in the pre-processing. K-means and hierarchical clustering require the number of clusters to be selected by the user. Several methods have been developed for determining the optimal number of clusters for such clustering algorithms, including the elbow method, the average silhouette method, and the gap statistic method [20]. However, such methods often involve generating the clusterings for every number of clusters within a given range, which is unfeasible with the high dimensionality of scRNA-seq data. Therefore, to fairly compare our algorithm with k-means and hierarchical clustering, the latter 2 algorithms are specified to generate the same number of clusters generated by our algorithm. The scipy Python package was used to run agglomerative hierarchical clustering (with Euclidean distance and Pearson correlation distance) [18], and the scikit-learn Python package was used to run k-means clustering [5]. The Seurat R package was used to perform principal component analysis (PCA) to extract high-variance eigenvectors as cell features [23]. As pre-processing was performed beforehand, the ScaleData function was run with do.scale = FALSE. For most of the datasets we tested with, the number of cells was much smaller than the number of genes, so we used one less than the number of cells as the number of principal components. In datasets with large numbers of cells (Bozec and Shen), we used 500 principal components. For PCA, the matrix of cell embeddings was used as the compressed feature matrix. For the other gene clustering algorithms, the vectors of gene expression within each gene cluster were averaged to produce the compressed feature matrix. For any parameters not mentioned above, default values were used. An Inspiron 15 7000 laptop with an 8th generation i7 processor was used for benchmarking times.

Motivation
Our expression for monotonicity is motivated by the desire for a gradual transition between positive monotonic behavior and negative monotonic behavior with regard to contribution to the numerator or denominator. In other words, suppose that Euclidean distance or Manhattan distance were used instead. If x i ≃ x j and y j >> y i , a slight shift from x i < x j to x i > x j would move a large contribution from the numerator to the denominator, or vice versa. On the other hand, our metric gives only minor weight to 2 points who deviate significantly in one dimension but insignificantly in the other dimension, so a shift from x i < x j to x i > x j would allow the numerator contribution to gradually reach 0 before allowing the denominator contribution to gradually increase from 0. Meanwhile, if the points differ significantly in both dimensions, we are more confident in their positive or negative monotonic behavior, so we grant them more contribution to the numerator or denominator, respectively. With randomly generated data, the expected monotonicity value is 1. It is worth noting that although the magnitudes of the values are factored into our metric, arbitrarily adjusting the scale of the values would impact the numerator and denominator by the same factor, leading our distance metric to be scale-invariant, as are many commonly used metrics.
Our metric's ability to take into account the magnitude of values comes at the cost of allowing outliers to potentially sway the monotonicity. As a result, pre-processing methods are necessary to transform the distribution of the data for more balanced monotonicity values. When clustering genes based on a monotonicity-based metric, it is important to note that uniformly expressed genes (with either very scarce dropout values or very abundant dropout values) are dangerous because they are likely to have high monotonicity values with many genes, even when a meaningful relationship may not exist. As a result, the pre-processing performed by the aforementioned SC3 methods is of critical importance here in filtering out such genes. Additionally, in the clustering process, we iterate through the genes in order of decreasing variance for two reasons: (1) Even among the genes that pass the pre-processing stage, it can be expected for there to be some relatively uniformly expressed genes. When these genes are involved, the monotonicity value would be volatile because both the numerator and denominator would be small. Considering uniformly expressed genes (with low variance) at the end would prevent them from swaying the average correlation values; and since the ultimate clustered expression values are obtained by averaging the gene vectors of expression values within each cluster, uniformly expressed genes will have little impact on the clustered expression values because they will have relatively uniform influence over the values of all cells.
(2) Due to the high dimensionality of scRNA-seq data, genes with high variances, which will tend to serve as the cluster "centroids", will tend to be well-separated.

Sample Data
The graphs below plot pairs of pre-processed gene expression data from the Biase dataset [24], featuring 49 cells, represented as points in the graphs. These sample data help illustrate how our monotonicity metric more effectively quantifies the degree of co-expression between genes compared to commonly used metrics. Here, we focus on the Pearson, Spearman, and Kendall tau-b correlations because they are scale-invariant and deterministic, allowing their performance to be independent of the number of data points and how the data are scaled [14].