Abstract
Single-cell computational pipelines involve two critical steps: organizing cells (clustering) and identifying the markers driving this organization (differential expression analysis). State-of-the-art pipelines perform differential analysis after clustering on the same dataset. We observe that because clustering forces separation, reusing the same dataset generates arti-ficially low p-values and hence false discoveries. In this work, we introduce a valid post-clustering differential analysis framework which corrects for this problem. We provide software at https://github.com/jessemzhang/tn_test.
Modern advances in single-cell technologies can cheaply generate genomic profiles of millions of individual cells [1, 2]. Depending on the type of assay, these profiles can describe cell features such as RNA expression, transcript compatability counts [3], epigenetic features [4], or nuclear RNA expression [5]. Because the cell types of individual cells often cannot be known prior to the computational step, a key step in single-cell computational pipelines [6, 7, 8, 9, 10] is clustering: organizing individual cells into biologically meaningful populations. Furthermore, computational pipelines use differential expression analysis to identify the key features that distinguish a population from other populations: for example a gene based on its relative expression level.
Many single-cell RNA-seq discoveries are justified using very small p-values [9, 11]. The central observation underlying this paper is: these p-values are often spuriously small. Existing workflows perform clustering and differential expression on the same dataset, and clustering forces separation regardless of the underlying truth, rendering the p-values invalid. This is an instance of a broader phenomenon, colloquially known as “data snooping”, which causes false discoveries to be made across many scientific domains [12]. While several differential expression methods exist [9, 13, 14, 11, 15, 16], none of these tests correct for the data snooping problem as they were not designed to account for the clustering process. As a motivating example, we consider the classic Student’s t-test introduced in 1908 [17], which was devised for controlled experiments where the hypothesis to be tested was defined before the experiments were carried out. For example, to test the efficacy of a drug, the researcher would randomly assign individuals to case and control groups, administer the placebo or the drug, and take a set of measurements. Because the populations were clearly defined a priori, a t-test would yield valid p-values. In other words, under the null hypothesis where no effect exists, the p-value should be uniformly distributed between 0 and 1. For single-cell analysis, however, the populations are often obtained, via clustering, after the measurements are taken, and therefore we can expect the t-test to return significant p-values even if the null hypothesis was true. The clustering introduces a selection bias [18, 19] that would result in several false discoveries if uncorrected.
In this work, we introduce a method for correcting the selection bias induced by clustering. To gain intuition for the method, consider a single-gene example where a sample is assigned to a cluster based on the expression level of the gene relative to some threshold. Fig. 1a shows how this expression level is deemed significantly different between two clusters even though all samples came from the same normal distribution. We attempt to close the gap between the blue and green curves in the rightmost plot by introducing the truncated normal (TN) test. The TN test (Fig. 1b) is an approximate test based on the truncated normal distribution that corrects for a significant portion of the selection bias. As we go from 1 gene to multiple, the decision boundary generalizes from a threshold to a high-dimensional hyperplane. We condition on the clustering event using the hyperplane that separates the clusters, and Supplementary Table 1 shows that this linear separability assumption is valid for a diverse set of published single-cell datasets. By incorporating the hyperplane into our null model, we can obtain a uniformly distributed p-value even in the presence of clustering. To our knowledge, the TN test is the first test to correct for clustering bias while addressing the differential expression question: is this feature significantly different between the two clusters? We then proceed to provide a data-splitting based framework (Fig. 1c) that allows us to generate valid differential expression p-values for clusters obtained from any clustering algorithm. Using both synthetic and real datasets, we argue that for a given set of clusters, not all reported markers can be trusted. Importantly, this point implies that
large correction factors for multiple markers can indicate overclustering;
plotting expression heatmaps where rows and columns are arranged by cluster identity can convey misleading information (e.g. Fig. S6 in [20] and Fig. 6b in [1]).
We consider the peripheral blood mononuclear cell (PBMC) dataset of 2700 cells generated using recent techniques developed by 10x Genomics [2], and this dataset was also used in a tutorial for the Seurat single-cell package [6]. Fig. 2a shows the Seurat pipeline output after preprocessing the dataset and running a graph-based clustering algorithm [21, 22, 23], yielding 9 clusters. For each of 7 approaches offered by Seurat (see Supplement for more details), we perform differential expression analysis on clusters 0 and 1, and we compare the obtained p-values to TN test p-values. Fig. 2a shows that while the TN test agrees with the differential expression tests on several genes (e.g. S100A4), it also disagrees on some other genes. The two genes with the most heavily corrected p-value were B2M and HLA-A. While several of the Seurat-provided tests would detect a significant change (e.g. the popular Wilcoxon test reported p = 8.5 × 10−30 for B2M and p = 3.8 × 10−17 for HLA-A), the TN-test accounts for the fact that this difference in expression may be driven by the clustering approach (p = 3.7 × 10−13 for B2M and p = 7.2 × 10−8 for HLA-A). Because the amount of bias correction is different for each gene, the TN test orders markers differently than clustering-agnostic methods. Comparisons are also performed for clusters 1 versus 3 and 2 versus 5, and the results are reported in Supplementary Fig. 2. This indicates that the artifacts of post-selection inference are fairly consequential even in real datasets.
We further explore how the TN test can be used to both validate and contest reported subtypes. For a dataset of 3005 mouse brain cells [20], the authors reported 16 subtypes of interneurons using 26 gene markers. Fig. 2b and Supplementary Fig. 3 shows that Int11, the only subtype that was experimentally validated using immunohistochemistry, received relatively small amounts of correction. Int1, Int12, and Int16, however, may need to further inspection.
We also demonstrate how the TN test can be used to gauge overclustering. We run the Seurat clustering pipeline on a dataset of 704 mouse embryonic stem cells (mESCs) [24] using two different clustering parameters, resulting in the two clustering results shown in Fig. 2c. For each pair of clusters, we look at the top 10 most significant genes chosen by the t-test. We correct these p-values using the TN test and observe the geometric average of the ratio of TN test p-values to t-test p-values. We see that for valid clusters (Clustering 1), the p-value obtained using the TN test is often even smaller, offering no correction. When clusters are not valid (Clustering 2), however, we observe a significant amount of correction.
State-of-the-art single-cell clustering pipelines such as Seurat can generate different clustering results on the same dataset (Supplementary Fig. 4). Importantly, different clustering results imply different null hypotheses when we reach the differential expression analysis step, which further undermines the validity of the “discovered” differentiating markers. Although data splitting reduces the number of samples available for clustering, we see that sacrificing a portion of the data can correct for biases introduced by clustering.
The post-selection inference problem arose only recently in the age of big data due to a new paradigm of choosing a model after seeing the data. The problem can be described as a two-step process: 1) selection of the model to fit the data based on the data, and 2) fitting the selected model. The quality of the fitting is assessed based on the p-values associated with parameter estimates, but if the null model does not account for the selection event, then the p-values are spurious. This problem was first analyzed in 2013 by statisticians in settings such as selection for linear models under squared loss [18, 19]. Practitioners often select a subset of “relevant” features before fitting the linear model. In other words, the practitioner chooses the best model out of 2d possible choices (d being the number of features), and the quality of fit is hence biased. One needs to account for the selection in order to correct for this bias [18]. Similarly, a single-cell RNA-seq dataset of n cells can be divided into 2 clusters in 2n ways, biasing the features selected for distinguishing between clusters. In this manuscript, we propose a way to account for this bias.
This work introduced and validated the TN test framework in the single-cell RNA-Seq application, but the framework is equally applicable to other domains where feature sets are large and clustering is done before feature selection. Because science has entered a big-data era where obtaining large datasets is becoming increasingly cheaper, researchers across domains have fallen into the mindset of forming hypotheses after seeing the data [12]. We believe that the TN test is a step towards the right direction: correcting data snooping to reduce false discoveries and improve reproducibility.
Online Methods
Both the software package and the code used to generate the results presented in this paper are available online at https://github.com/jessemzhang/tn_test.
Simulation details
We validate the method on synthetic datasets where the ground truth is fixed and known. For the experiments discussed in Supplementary Fig. 1, we sample data from normal distributions with identity covariance prior to clustering, resulting in data sampled from truncated normal distributions post-clustering. To estimate the separating hyperplane a, we fit an SVM to 10% of the dataset (50% for Fig. 1c), and we work with the remaining portion of the dataset after relabeling it based on our estimate of a. Supplementary Fig. 1a shows results for the 2-gene case where no differential expression should be observed (i.e. the untruncated means are identical). Note that for this example, gene 1 needs a larger correction factor than gene 2 because the separating hyperplane is less aligned with the gene 1 axis. We see that when both the variance and separating hyperplane a are known, the TN test completely corrects for the selection event. As we introduce more uncertainty (i.e. if we need to estimate variance or a or both), the correction factor shrinks; however, the gap is still significantly better than for the t-test case. Supplementary Fig. 1b repeats the experiment for the case where gene 1 is differentially expressed. The TN test again corrects for the selection bias in gene 2, but we still obtain significant p-values for gene 1 though not nearly as extreme as for the t-test case. Supplementary Fig. 1c shows that as we increase d, the number of genes, the minimum TN test p-value across all d genes follows the family-wise error rate (FWER) curve. Since FWER represents the probability of making at least 1 false discovery and naturally increases with d, this highlights the validity of the TN test. In comparison, the t-test returns extreme p-values especially for lower values of d. As d increases, however, the selection bias incurred by our simple clustering approach disappears. While the TN test provides less gain in higher (≥ 200) dimensions, we note that for real datasets, cluster identities are often driven by an effectively small amount of genes, which is why several single-cell pipelines perform dimensionality reduction before clustering. Supplementary Fig. 1b and 1d show that when certain genes are differentially expressed, the TN test is still able to find them.
Single-cell dataset computational details
For all experiments discussed in Fig. 2, we randomly split set of samples in half into datasets 1 and For Fig. 2a, we recluster dataset 1 with Seurat using clustering parameters that would result in 2 clusters. We use SVM to obtain a hyperplane that perfectly separates the two clusters, and we use this hyperplane to assign labels to samples in dataset 2. When comparing the TN test results to those obtained using other approaches, we run the entire Seurat pipeline (including differential expression analysis) on dataset 1. For the mouse brain cell and mESC datasets analyzed in Fig. 2b and 2c, we assume that the generated labels are ground truth, and therefore we do not perform the reclustering part of the analysis framework shown in Fig. 1. For Fig. 2b, we only report correction factors for cases where SVM fit the data well, meaning that the new labels generated for dataset 2 have at least a 80% match with the original labels. We note that this does not contradict the linear separability assumption discussed in the main text. The sizes of the interneuron subclusters range from 10 to 26, and therefore the SVM was occasionally fit on as few as 5 samples, resulting in an inability to generalize. Additionally, we only report correction factors greater than 0.
Clustering model
To motivate our approach, we consider the simplest model of clustering: samples are drawn from one of two clusters, and the clusters can be separated using a linear separator. As we show in Supplementary Table 1, the linear separability assumption is often true for high-dimensional datasets such as single-cell datasets. For the rest of this section, we assume that the hyperplane a is given and independent from the data we are using for differential expression analysis. For example, we can assume that in a dataset of n independent and identically distributed samples and d genes, we had set aside n1 samples to generate the two clusters and identify a, thus allowing us to classify future samples without having to rerun our clustering algorithm. We run differential expression analysis using the remaining n2 = n − n1 samples while conditioning on the selection event. More specifically, our test accounts for the fact that a particular a was chosen to govern clustering. We will later demonstrate empirically that the resulting test we develop suffers from significantly less selection bias.
For pedagogical simplicity, we start by assuming that our samples are 1-dimensional (d = 1) and our clustering algorithm divides our samples into two clusters based on the sign of the (mean-centered) observed expressions. Let Y represent the negative samples and Z represent the positive samples. We assume that our samples come from normal distributions with known variance 1 prior to clustering, and we condition on our clustering event by introducing truncations into our model. Therefore Y and Z have truncated normal distributions due to clustering: Here, the 𝕀 terms are indicator functions denoting how truncation is performed, and the Φ terms are normalization factors to ensure that fY and fZ integrate to 1. Φ represents the CDF of a standard normal random variable. µL and µR denote the means of the untruncated versions of the distributions. We want to test if the gene is differentially expressed between two populations Y and Z, i.e. if µL = µR.
1D TN test when variance = 1 and clustering is performed based on sign of expression
Input: Two groups of samples Y, Z
Output: p-value
Using maximum likelihood, estimate µL, µR, the mean parameters of the truncated Gaussian distributions
To obtain the null distribution, set µL = µR = (µL + µR)/2, then obtain estimates of , the means and the variances of truncated distributions
Perform an approximate test with the statistic where m and n represent the number of samples of Z and Y, respectively. This statistic is approximately 𝒩 (0, 1) distributed.
Derivation of the test statistic
The joint distribution of our n samples of Y with our m samples of Z can be expressed in exponential family form as where and represent the sample means of Y and Z, respectively. ψ is the cumulant generating function, and h is the carrying density. Please see the Supplement for more details. To test for differential expression, we want to test if µL = µR, which is equivalent to testing if µR − µL = 0. With a slight reparametrization, we let θ = (µR − µL)/2 and µ = (µR + µL)/2, resulting in the expression: We can test design tests for θ = θ0 using its sufficient statistic, [25]. From the Central Limit Theorem (CLT), we see that the test statistic Intuitively, this test statistic compares ,the gap between the observed means, to mµZ − nµY, the gap between the expected means. For differential expression, we set θ0 = 0. Because under the null µL = µR are unknown, we estimate them from the data by first estimating µL and µR via maximum likelihood. Although the estimators for µL and µR have no closed-form solutions due to the Φ terms, the joint distribution can be represented in exponential family form. Therefore the likelihood function is concave with respect to µL and µL, and we can obtain estimates via gradient ascent. We then set . This procedure is summarized in Algorithm 1. We note that approximation errors are accumulated from the CLT approximation and errors in the maximum likelihood estimation process, and therefore the limiting distribution of the test statistic should have wider tails. Despite this, we show later that this procedure corrects for a large amount of the selection bias.
TN test for d dimensions and unknown variance
In this section, we generalize our 1-dimensional result to d dimensions and non-unit variance. Our samples now come from the multivariate truncated normal distributions where µL, µR, Σ denote the means and covariance matrix of the untruncated versions of the distributions. We assume that all samples are drawn independently, and Σ is diagonal: if i = j else Σij = 0. The joint distribution of our n samples of Y with our m samples of Z can be expressed in exponential family form as
TN test
Input: Two groups of samples Y, Z, a separating hyperplane a
Output: p-value
Using maximum likelihood, estimate the mean and variance parameters of the truncated Gaussian distributions on either side of the hyperplane: µL, µR, Σ
For gene g, obtain the marginal distributions under the null (i.e. setting
Using numerical integration, obtain estimates of , the means and the variances of and
Perform an approximate test with the statistic where m and n represent the number of samples of Z and Y, respectively.
where ψ is the cumulant generating function, h is some carrying density, and ‖·‖F denotes the Frobenius norm. The natural parameters η1, η2, η3 are equal to To test differential expression of gene g, we can test if , which is equivalent to testing or . In similar spirit to the 1-dimensional case, we perform a slight reparameterization, letting and : We again design tests for θg using its sufficient statistic, : During the testing procedure, we want to evaluate if θg = 0 (i.e. if gene g has significantly different mean expression between the two populations). With θg = 0 as our null hypothesis, we compute the corresponding parameters under the null, allowing us to evaluate the probability of seeing a TN statistic at least as extreme as the one observed for the actual data.
Like in the 1-dimensional case, we use maximum likelihood to estimate η1, η2, and η3, leveraging the fact that the likelihood function is concave because the joint distribution is an exponential family. After estimating the natural parameters, we can easily recover Σ, µL, and µR. To obtain estimates for under the null, we first set . We then use numerical integration to obtain the first and second moments of gene g’s marginal distributions: The TN test procedure is summarized in Algorithm 2 and Fig. 1. More details regarding the above derivations are given in the Supplement. We note that this test is approximate because the test statistic becomes normally distributed only when we have a large number of samples (m and n both large). In practice, real datasets involve a finite number of samples, and the parameters need to be estimated from the data. Therefore the tails of our test statistic’s null distribution should be bigger in order to capture the added uncertainty. As shown in this work, we can obtain significant selection bias correction (for both real and synthetic datasets) despite this approximation error.
TN test for post-clustering p-value correcting
Clustering and TN test framework
Input: Samples X
Output: p-value
Split X into two partitions X1, X2
Run your favorite clustering algorithm on X1 to generate labels, choosing two clusters for downstream differential expression analysis
Use X1 and the labels to determine a, the separating hyperplane (e.g. using an SVM)
Divide X2 into Y, Z using the obtained hyperplane
Run TN test using Y, Z, a
We describe a full framework (Fig. 1) for clustering the dataset X and obtaining corrected values via the TN test. Using a data-splitting approach, we run some clustering algorithm on one portion of the data, X1, to generate 2 clusters. For differential expression analysis, we estimate the separating hyperplane a using a linear binary classifier such as the support vector machine (SVM). This hyperplane is used to assign labels to the remaining samples in X2, yielding Y and Z. Finally, we can run a TN test using Y, Z, and a. This approach is summarized in Algorithm 3. Note that in the case of k> 2 clusters, we can assign all points in X2 using our collection of hyperplanes.
Acknowledgements
We thank Jonathan Taylor, Martin Zhang, and Vasilis Ntranos of Stanford University and Aaron Lun of the Cancer Research UK Cambridge Institute for helpful discussions about selective inference and applications of the method. GMK and JMZ are supported by the Center for Science of Information, an NSF Science and Technology Center, under grant agreement CCF-0939370. JMZ and DNT are supported in part by the National Human Genome Research Institute of the National Institutes of Health under award number R01HG008164.