Abstract
Clustering is routinely applied to modern high-dimensional data, including gene expression measurements from microarray and RNA-seq. Iteratively estimating the cluster centers and assigning memberships according to pre-defined criteria, the clustering algorithms classify genes or samples to help ascertain molecular processes or sub-types. For example, the cluster membership assignments of unlabeled single cells from massively parallel RNA-seq experiments are used as the cell identities. However, how can we evaluate if the cluster memberships are correctly assigned? To this end, we introduce the jackstraw methods for unsupervised classifications that rigorously test the assignments of data features into their clusters. By learning uncertainty in clustering the noisy data, the proposed jackstraw methods can identify statistically significant features that truly make up the corresponding clusters. Simulation studies using K-means clustering confirm the accuracy of the proposed statistical significance. We consider mRNA abundances of 5981 Saccharomyces cerevisiae genes under cell cycle. After the proposed jackstraw methods are applied for K = 6 clusters, we estimate and use posterior inclusion probabilities (PIP) to select and visualize the canonical features for their clusters. We also investigate the single cell RNA-seq (scRNA-seq) data from a mixture of Jurkat and 293T cell lines, where individual cell identities are unknown. The jackstraw methods evaluate cluster membership assignments of 3381 unlabeled single cells such that the majority of multiplets are identified in an unsupervised manner. When clustering is employed in high-dimensional data analysis, the proposed tests enable rigorous evaluation of membership assignments that readily improve feature selection and visualization.
Software jackstraw package in R available at https://github.com/ncchung/jackstraw.
- Abbreviations
- scRNA-seq
- single cell RNA sequencing
- PCA
- principal component analysis
- PIP
- posterior inclusion probability
- FDR
- false discovery rate
Introduction
High-throughput technologies have enabled large-scale measurements of DNA, RNA, metabolites, and others. Recent technological and experimental advancements, such as single cell RNA-seq [1, 2] and mass-spectrometry [3, 4], have resulted in increasing challenges and opportunities for using unlabeled data and unsupervised learning. For example, a single cell RNA-seq (scRNA-seq) technology enables gene expression measurements of thousands of blood cells in order to elucidate molecular subtypes. Unsupervised assignments of unlabeled single cells to K clusters according to their gene expression profiles provide cluster-based cell identities. Despite diverse clustering techniques available, it has not been possible to re-use data-driven clusters and test their membership in downstream statistical analyses without incurring artificially inflated significance. We have developed a novel and general method to statistically test the assignment of a data feature to a particular cluster, aiding in feature selection, dimension reduction, and visualization.
Clustering has been one of the most popular analysis methods for high-dimensional genomic data. In the absence of external and accurate labels, clustering can identify and approximate co-regulated subsets of genomic variables (e.g., genes, loci) or subtypes of related observations (e.g., patients, single cells). For example, a conventional microarray study measures gene expression of samples from either control or disease groups. Since the molecular functions of genes might not be known, the unsupervised classification of genomic variables can help identify co-varying subsets that form molecular processes [5–7]. These clusters and membership assignments of genes have been extensively used in the visualization of systematic patterns and outliers [8, 9]. Recently, there have been many studies where mRNA abundances from thousands of single cells are measured en masse using scRNA-seq [10–12]. Then, gene expression profiles of unlabeled single cells are clustered to obtain cell identities. These cell identities, which may be related to subtypes, lineage, or other molecular factors, are often used in downstream differential expression and other analyses. Note that a data feature refers to either a variable or an observation, since clustering can be applied on either dimension.
After automatically assigning observed features to K clusters that are summarized by K centers, we are interested in testing the membership assignments of individual features. This will improve the data-driven cell identities in scRNA-seq experiments, as well as the clustering of genomic variables to help elucidate molecular processes. To this end, we have developed an innovative data resampling and testing scheme for unsupervised classification that rigorously evaluates whether observed features are truly members of corresponding clusters. By estimating p-values and posterior inclusion probabilities (PIPs), the proposed methods can identify and visualize features that have been accurately and reliably assigned to the clusters. This bridges direct estimation of latent variables from large-scale data and fundamental hypothesis framework, which readily provides p-values, false discovery rates, and posterior probabilities crucial for data exploration and inference.
By utilizing a newly developed resampling technique called the jackstraw [13], the proposed methods learn overfitting inherent in using cluster centers that are estimated from the data. In other words, the proposed methods enable the accurate statistical testing of cluster membership while taking into account uncertainty in the clustering algorithms. Simulation studies demonstrate accurate and favorable operating characteristics. The joint behavior of p-values are scrutinized by conducting 100 independent simulations that satisfy the joint null criterion [14]. Two applications are presented using two different dimensions (genomic variables and samples) available for unsupervised classification. Yeast cell cycle microarray data [8] are used to cluster 5981 genes into K = 6 clusters, whose statistically significant members are identified. We also consider the scRNA-seq data from a mixture of two different cell lines [12]. By applying the proposed methods on unlabeled single cells, we show improved classification and visualization of cell identities. These proposed methods are implemented in a R package called jackstraw (https://github.com/ncchung/jackstraw).
Methods and Algorithms
The observed data Y(m, n) contains m rows and n columns. Because either a set of variables (e.g., genes) or a set of observations (e.g., cells) may be clustered, we refer to m rows as m observed features for simplicity1. Then, m1,…,mK features are assigned into corresponding 1,…,K clusters, where . The center ck for k = 1,… K summarizes that kth cluster. For example, in K-means clustering, the nearest means are used to assign observed features to the clusters. If yi is assigned to kth cluster with ck, its membership indicator βi, k is 1. By definition, the subset of features yi with βi, k = 1 make up ck.
Cluster centers and membership assignments may be viewed as approximating latent variables L and membership indicators B (i.e., dichotomous coefficients). Latent variables lk for k =1,…, K may assume a wide range of patterns including continuous or categorical structures [15,16]. Clustering algorithms simultaneously identify the data features that contribute to the estimates of Lk:
If a particular ith feature is truly associated with a kth latent variable, its coefficient bi, k is 1. Otherwise, 0. Feature-specific noise ei is defined as identically and independently distributed. Row-wise means are handled by centering the data, whereas row-wise variances are preserved by our proposed resampling scheme.
There have been important developments in clustering that consider mixture or latent variable models that improve our understanding and interpretation of data [17–19]. However, even model-based clustering approaches or regularization do not provide cluster centers and membership assignments that can be used again against the observed features, resulting in so-called “double dipping.” Our proposed approach learns and incorporates inevitable uncertainty in assigning features to clusters, that are directly derived from the same set of features. This mirrors the jackstraw test when latent variables are estimated using principal component analysis (PCA) [13]. Our statistical significance approach using the jackstraw strategy is related to [20,21], as well as Bayesian p-values [22,23]. Furthermore, regularized methods are available for clustering, such that sparsity can be induced [24,25].
Jackstraw Data and Strategy
We apply the jackstraw strategy to clustering unlabeled features of observed data Y. Generally, we would like to create a relatively small number s (≪ m or n) of synthetic null features without disturbing the overall patterns of systematic variation. The jackstraw data Y* refers to this revised data, where m – s observed features are intact and s synthetic null features have been resampled with replacement (Figure 1). Applying the clustering algorithm on Y* produces cluster centers that are almost identical to the original cluster centers ck (for k =1,…, K).
Because of the nature of clustering algorithms, all features in Y*, including s synthetic nulls, will be assigned to one of K clusters. When a synthetic null feature is assigned to kth cluster, an association statistics between and is under the null model that assumes independence since is i.i.d. by definition. Yet, because does indeed contribute to , we effectively learn the overfitting characteristics of the clustering algorithms. Over a large number of iterations b = 1,…, B, we can form the empirical distribution of null statistics as in Algorithm 1.
Feature-level evaluation of cluster membership requires a pre-defined number of clusters K. There is a vast amount of literature on the choice of K, which is beyond the scope of this study. In practice, a data analyst must explore the observed data, often utilizing prior knowledge, visualization, and heuristics. Methods have been proposed in the last five decades in this area of research including cluster stability or reliability statistics [26–34]. We recognize that data normalization, cluster stability, and other preclassification steps are essential to sensible unsupervised learning. Through re-analysis of microarray and scRAN-seq data, we showcase the jackstraw tests in a context of broader unsupervised learning pipelines.
Jackstraw Strategy for Unsupervised Classification
Apply the clustering algorithm to Y, to obtain C and β
Compute the observed statistics, relating Y and C
Create Y* with a small number of synthetic null features y*
Apply the clustering algorithm to Y*, to obtain C* and β*
Compute the null statistics, relating y* and C*
Repeat the above three steps to form an empirical distribution of null statistics
There are idiosyncratic outcomes of clustering that require our attention. Some clustering algorithms may generate an empty cluster or a singleton (a cluster with one feature). An empty cluster can be ignored in our methods as it does not contain any observed feature as a member. We consider the only feature of a singleton as its true member. It is possible that synthetic null features are rarely clustered into a certain cluster, such that there is a limited amount of empirical null statistics for that cluster. This likely occurs when that cluster is substantially smaller than others or has very distinct centers such that its members are tightly (and accurately) grouped in n dimensions. An increase of B would alleviate this, in tandem with examining the overall p-value distribution.
Jackstraw Tests for K-means Clustering
We now present a detailed algorithm using K-means clustering [35–37]. K-means clustering is one of the most popular and well-studied algorithms that has been applied to a wide range of genomic studies. In this Algorithm 2, we use F-statistics where the full models include appropriate cluster centers. The use of F-statistics allows us to flexibly specify the full and null models, which may incorporate other covariates in more complex settings.
Jackstraw Test for Membership Assignments in K-means Clustering
Apply K-means clustering to the observed data Y, resulting in cluster centers ck for k = 1,…, K and membership assignments bi, K for i = 1,…, m and K = 1,…, k
Compute the observed statistics F1,…, Fm, where the full models include corresponding cluster centers ck
Create s synthetic nulls by resampling a small proportion of features s ≪ m with replacement, resulting in a jackstraw data Y*, with m – s observed features and s synthetic features
Apply the clustering algorithm to the jackstraw data Y*, resulting in cluster centers and membership assignments
Compute the null statistics , where the full models include corresponding cluster centers
Repeat the above three steps b = 1,…, B times to obtain a total s * B of null statistics
Compute the p-values by empirically ranking the observed statistics among the null statistics, stratified by cluster assignments
The choices of s and B controls the speed of computation, while the total number of null statistics (s × B) determines the overall p-value resolution. For B iterations we need to cluster the jackstraw data B times, and for each iteration b = 1,…, B, we can obtain s null statistics. Assuming s × B is hold constant, a smaller s provides more accurate p-values, while increasing computational burdens. Therefore, we want to ensure the original clusters are preserved as much as possible, permitting the computational power. As we increase the number of synthetic null features s in Y*, the overall systematic variation captured by K cluster centers may be substantially disrupted (seen as an increasing proportion of y* in Figure 1). While we recommend s < .1 × m for genomic data, although the number of clusters (K) and the proportion of features assigned to them (m1,…, mk) must be considered. A higher value of K for a given m would need a smaller s, so that the clusters with limited members are represented in the jackstraw data.
The overwhelming disruption would further inflate null F-statistics, since a larger number of synthetic null features would make up . In extreme scenarios where all features have been resampled, the new cluster centers are completely dominated by independent synthetic null features. This operating characteristic allows us to guard against artificially inflated significance and to guide the input parameters for the proposed algorithm. In practice, we input C as the initial centers for K clusters when clustering the jackstraw data for efficient convergence. Furthermore, when a computational cost is a concern, one may correlate C and C* to ensure comparability.
In contrast, the conventional resampling methods can be applied to the cluster centers, resulting in a “naive” significance test. After all m features are resampled with replacement, their F-statistics with respect to ck are used to form an empirical distribution of null statistics. Observed F-statistics are compared to this empirical distribution to obtain naive p-values. This circular analysis inflates statistical significance, since the observed features are used twice to compute the cluster centers and to again test against the cluster centers. Essentially, this represents how the bootstrap or the permutation approaches would be applied to cluster membership assignments. We apply the conventional methods in simulation studies to demonstrate how the jackstraw approach overcomes this type of circular analysis.
Posterior Inclusion Probabilities
After the membership assignments for kth cluster are tested using the jackstraw, we investigated how to harness their mk p-values (or, the distribution of null statistics) to filter, de-noise, and visualize the clusters. When considering high-dimensional features typical in large-scale genomic studies, it is advantageous to consider a family of multiple hypotheses simultaneously [38]. Particularly, from mk jackstraw p-values, we propose to calculate posterior probabilities that features are included in a given cluster. A discussion of posterior inclusion probabilities (PIPs) that are used for shrinkage and improvement of latent variable estimates is available in Chapter 3 of [39].
Consider that the mk jackstraw p-values pk = p1, k, …,pmk, k are obtained for mk features that have been assigned to kth cluster. We are interested in estimating a posterior probability that bi, k ≠ 0, since non-zero coefficients imply their bona fide inclusion in the cluster:
PIP can be readily obtained by estimating Pr(bik = 0|pmk) through an empirical Bayes approach [40,41]. In multiple hypothesis testing, Pr(bik = 0|pmk) is called a local false discovery rate (FDR). There also exist related Bayesian methods that could be explored for specific applications and prior knowledge [42–44]. These results in m PIPs for K families of multiple hypothesis tests corresponding to K clusters, that can be used for:
Retaining a subset of features yi with ρi > αρ, where αρ is a user-defined threshold,
Visualizing features in reduced dimensions (e.g., PCA, t-SNE) where transparency ~ ρi,
Improving the cluster centers by weighting the corresponding features with ρi.
Local FDRs and PIPs from K families of multiple hypothesis tests can be flexibly combined for downstream analyses, as to aid feature selection and dimension reduction. When applying the proposed methods on microarray and scRNA-seq data, we incorporate PIPs to hard-threshold and soft-threshold the observed features. Furthermore, this approach may improve a wide range of clustering, by providing probabilistic measures and/or translating into fuzzy clustering algorithms.
Results
Unsupervised classification allows us to non-parametrically cluster large-scale data in absence of accurate external labels for data features. Given the set of features are assigned into K clusters, the proposed methods test their cluster membership assignments. To demonstrate its operating characteristics, we conducted comprehensive simulation studies, which enabled a critical assessment using the underlying truth (Oracle Groups). We then applied the proposed methods on a microarray study of Saccharomyces cerevisiae that examines the cell cycle and another scRNA-seq data from a mixture of Jurkat and 293T cell lines whose cell identities are of interest.
Simulation Studies
In the simulation studies, we follow the latent variable model described in Methods and Algorithms. Latent variables L are drawn from the Normal(μ = 0, σ2 = 1) distribution. Relationships between lk and features are given by dichotomous coefficients B where bi, k indicates whether yi is a member of lk for k = 1,…, K and i = 1,…, m. The noise B is drawn i.i.d. from Normal(0, ), where its variance governs the noise level. A total of m = 1000 features (rows) are simulated over n = 100 dimensions (columns). Forming Oracle Group A, 500 rows are true members of the signal cluster arisen from l1 with bi, 1 = 1 for i = 1,…, 500. Other 500 rows are purely noise, in Oracle Group B, which can be viewed as being centered around the n-dimensional origin. Therefore, a true proportion of null features is π0 = .50.
We simulated three scenarios using = 5,10,15 as an increasing noise level brings these two groups closer and makes the clustering task more difficult. PCA was applied on the dataset realized from each configuration to visualize the top 2 PCs (Figure S1). Being blind to Oracle Groups, the K-means clustering and the jackstraw tests were applied. Theoretically, the null p-values from the features that are not related to the latent variables (corresponding to Oracle Group B) should form the Uniform(0,1) distribution, which can be evaluated by the Kolmogorov-Smirnov (KS) test. We repeated a given simulation configuration 100 times independently and investigated how 100 KS test p-values from 100 independent simulations meet the joint null criterion [14].
We describe one simulation from the main scenario involving a moderate amount of noise = 10. While 1000 features were split equally between Cluster 1 and 2, 30 and 470 null features were members of Cluster 1 and 2, respectively. Because Cluster 1 contained 470 features related to the latent variable l1, its center and l1 were highly correlated with a Pearson correlation of 0.99. The jackstraw test was then applied on the simulated data with s = 100 synthetic null features over B = 5000 iterations, while being blind to simulation parameters. Figure 2(a) shows histograms of p-values stratified by Oracle Groups as parametrized by dichotomous coefficients in B. In Oracle Group B, the jackstraw p-values corresponding to 500 null features are uniformly distributed between 0 and 1. In contrast, the naive significance tests are highly anti-conservative, pushing towards 0. In Oracle Group A, the jackstraw p-values are greater than the naive p-values because the jackstraw approach learns the overfitting characteristics and fixes an anticonservative bias (Figure 2(a)). Utilizing all m p-values, the proportion of null features are estimated to be = 0.55 for the jackstraw and = 0.29 for the naive methods.
We repeated this configuration to ensure accuracy and robustness across 100 independent simulations. In each simulation, we examined the joint behavior of 500 null p-values from Oracle Group B using a doublesided KS test. When the joint behavior of those KS test p-values follows the i.i.d. Uniform(0,1) distribution (where the double KS test p-value > αjnc), the subsequent multiple hypothesis testing procedures, including false discovery rates, hold true [14]. In other words, meeting the stringent standard of the joint null criterion demonstrates that the proposed methods overcome “double-dipping” inherent in utilizing cluster centers and membership assignments and that the p-values are jointly and marginally accurate [14]. A set of 100 KS test p-values, estimated from both the jackstraw and naive methods, are visualized against the Uniform(0,1) distribution (Figure 2(b)). The jackstraw tests satisfy the joint null criterion, where 100 KS test p-values are uniformly distributed (double KS test p-value = 0.79). In contrast, the naive methods are strongly anti-conservative, where 100 KS test p-values are strongly skewed towards 0 (double KS test p-values < 2.2 × 10‒16).
Results from two other simulation configurations, that are also independently repeated 100 times, are shown in Figure S2 and Figure S3. Simulated data with a relatively small noise = 5 can be almost perfectly clustered. Nonetheless, the naive methods exhibit substantial overfitting where the double KS test p-value is < 2.2 × 10‒16. The double KS test p-value for the jackstraw tests in this configuration is 0.81 (Figure S2). On the other hand, a greater noise with = 15 represents a situation whose members of different clusters are substantially overlapping (Figure S1). The jackstraw tests indeed satisfy the joint null criterion with the double KS test p-value of 0.67 (Figure S3). Additional simulation studies further confirm that unlike the naive methods that overfit and produce an anti-conservative bias (downward deviations from the diagonal line), the proposed methods take account for uncertainty in clustering and result in valid p-values that enable rigorous error control.
Genomic Applications
Microarray Data from Yeast Cell Cycle Experiments
Cell cycle in Saccharomyces cerevisiae and other organisms are traditionally known to progress through discrete stages, such as M, G1, S, and so on [45]. With the advent of the microarray, gene expression levels from synchronized S. cerevisiae samples had been measured and analyzed in order to identify comprehensive sets of genes under cell cycle [8,46-48]. However, in these conventional studies, experimentally verified genes under cell cycle are used to identify related genes that follow similar patterns. In contrast, we re-analyzed the expression data of 5981 genes from [8] in an unsupervised manner.
Genome-wide mRNA levels of elutriation-synchronized yeast cells were measured at 30 min intervals for 390 min (approximately 1 cell cycle) [8]. We processed and normalized this gene expression data according to [13,49]. The number of clusters K = 6 was determined by the prior knowledge that there exist 6 stages of cell cycle [45]. While there are on-going debates on how to characterize and categorize cell cycle progression, K = 6 seems to be a reasonable choice. After having gotten K = 6 clusters from applying K-means clustering, we conducted the proposed jackstraw tests with s = 300 and B = 10000 to identify canonical genes within those clusters.
Histograms of p-values are shown in Figure 3(a), where the proportions of null features π0 for 6 clusters are estimated to be .143, .149, .116, .178, .170, .175, .087, respectively. Note that the numeric values identifying those clusters are arbitrary without a meaningful order, but consistent within this manuscript. From a set of p-values, we calculated posterior probabilities that those genes are truly members of their assigned clusters. For example, among 709 genes that are originally assigned to the cluster 4, 45.1% (320) have posterior inclusion probabilities (PIPs) > 0.9. Repeating this analysis for all 6 clusters, a total of 3826 genes are found to be significant at the same PIP threshold (Figure 3(c)). In other words, these are the canonical genes that drive the clusters of cell cycle.
Single Cell RNA-Seq Data from Jurkat and 293T Cells
Whereas conventional microarray and RNA-seq experiments obtain “bulk” gene expression from a sample that contains multiple cells, scRNA-seq enable more precise and accurate quantification from single cell samples. Recent studies using high-throughput and efficient scRNA-seq often measure gene expression from unlabeled single cells, in order to elucidate detailed molecular landscapes and identify cell identities (e.g., blood sub-types, sub-classifications of a disease) [10–12]. Commonly, the cell identities are determined by applying the clustering algorithms to their gene expression profiles.
We analyzed the scRNA-seq data from [12] that used a mixture of Jurkat and 293T cells (50:50). Note that while the mixture proportion is known, the identities of individual cells that have been sequenced are unknown. Because Jurkat (male and expressing CD3D) and 293T (female and expressing XIST) cell lines are highly distinct, we observed intelligible two groups separated along the 1st PC from their gene expression profiles [12]. However, massively parallel scRNA-seq regrettably generates multiplets (doublets, triplets, etc).
The rate of multiplets increases linearly with the recovered cell number, and through single nucleotide variant (SNV) detection, they inferred a 3.1% multiplet rate for this mixture experiment [12]. For ~ 10000 single cells, [12] reports > 8% multiplet rates. The ambiguous identities of singe cells would become increasingly challenging as scRNA-seq becomes more affordable and widespread.
Following the original analysis pipeline, we applied the K-means clustering on the top 10 PCs based on unique molecular identifier (UMI) counts. The jackstraw tests for those K = 2 clusters were conducted with s = 100 and B = 10000. We found that the jackstraw p-values capture deviation away from two centers, along the 1st PC axis (Figure 4(a)). Using the q-value methodology [50], the proportion of null features (that are not members of the clusters) is estimated to be = 0.05. Then, we computed the proposed PIPs from p-values (Figure 4(b)). At PIP < 0.80 (equivalent to 20% local FDRs), 3.3% of 3381 single cells would be removed from corresponding clusters, effectively and automatically removing the majority of suspected multiplets. Instead of hard-thresholding the single cell samples at an arbitrary threshold, we can also visualize posterior probabilities as levels of transparency in a conventional PCA projection, where the top 2 PCs are plotted (Figure 4(c)). Please note that because dimension reduction does not fully capture local and global structures in the original high dimensions, distances in reduced dimensions (PCA, t-SNE, and alike) should be considered with caution.
Discussion
The explosion of biological data has increased the importance of unsupervised learning. Without the external and accurate labels for the observed data, unsupervised learning aims to estimate latent structure, reduce dimensions, and classify data features. In particular, clustering of high-dimensional genomic data has led to better understanding of and informative hypotheses for biological functions [8,51], molecular subtypes [7,52], and cell identities [53]. However, data-dependent classification cannot be used in downstream analyses without incurring spurious statistical significance. Our proposed methods solve this challenge by learning the uncertainty inherent in deriving clusters from the data and conducting a statistical test using the jackstraw strategy.
There exists a wide range of clustering algorithms to automatically assign the m observed features into K clusters. The proposed methods test whether the observed features are correctly assigned to the corresponding clusters. Our key ingredient is to generate and re-cluster the jackstraw data, which include a very small number s of synthetic null features. Because of s ≪ m, the majority of observed features are intact, resulting in cluster centers that are almost identical to the original cluster centers. Subsequently, eventual assignments of s synthetic null features into K clusters are used to derive the empirical null distribution. We have demonstrated favorable operating characteristics using simulated and real genomic data. The proposed PIP methods open new possibilities for selecting canonical cluster members, shrinking cluster centers and improving cluster algorithms. Furthermore, the proposed methods may adaptively guide the choice of stable clusters.
Our proposed strategy enables rigorous application of unsupervised learning, such that the estimated latent structure can be re-used in downstream analyses. The jackstraw test for PCA and related methods [13] have been used in many specialized areas of genomic studies [11,12,51,54–56]. Complemneting this successful approach, we have developed the jackstraw test for clustering. It may be useful to integrate both variants of the jackstraw tests, from selecting highly informative genes to deriving cell identities. Differential expression analyses based on cluster-based cell identities may become more robust by incorporating the jackstraw tests. Because the proposed methods are not limited to genomics, we anticipate its adaptation in other fields of data-intensive science.
Acknowledgments
This research was supported in part by the Polish National Science Centre (NCN) grants 2016/23/D/ST6/03613.
Footnotes
↵1 This convention is also followed in the software package where the rows of input data are clustered and tested.