A statistical method for identifying consistently important features across samples

In many applications, features with consistently high measurements across many samples are particularly meaningful and useful for quality control or biological interpretation. Identification of these features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values. We present a general method called conserved feature discovery (CFD) for identifying features with consistently strong signals across multiple conditions or samples. Given real-valued data, CFD requires no parameters, makes no assumptions on the underlying sample distributions, and is robust to differences across these distributions. We show that with high probability CFD identifies all true positives and no false positives under certain assumptions on the median and variance distributions of the feature measurements. Using simulated data, we show that CFD is tolerant to a small percentage of poor quality samples and robust to false positives. Applying CFD to RNA sequencing data from the Human Body Map project and GTEx, we identify lists of housekeeping genes as highly expressed genes across tissue types and compare to previous results in this domain. CFD is consistent between the two data sets, and identifies lists of genes enriched for basic cellular processes as expected. The framework can be easily adapted for many data types and desired feature properties.


Introduction
Many biological applications involve measuring features across a range of samples, bringing up the natural question of which features have consistently high values across samples. Despite many methods for identifying features with significant differences between sets of samples [19,18,20], there is no general statistical method for identification of features that are consistent across sample sets. Similarity across conditions can signal the importance of a feature, such as an epigenetic mark required for proper gene regulation or an oncogene that is highly expressed across many cancer types. Genes that are highly expressed across many healthy tissue types are likely to be essential for cellular functioning, and can be used as controls for data normalization. This work presents a statistical method for identifying features that, across a majority of samples, have high measurements relative to their respective sample distributions.
Previous related methods focus on specific use cases, such as reproducibility measurements. Irreproducible discovery rate (IDR) is a statistical method to identify features from high-throughput sequencing experiments that are consistent across replicates [13]. In principle this is a very similar goal to the one presented here, but the assumptions inherent to IDR are specific to the case of replicates. IDR looks for the top n signals with highest values, where the challenge is determining the n at which the correspondence between replicate signals drops off. Additionally, IDR expects the input samples to have similar distributions among the highest values, which is an appropriate assumption when looking at replicates but not when studying non-replicate samples.
Identification of features in single-cell data that are stable across cells has recently become an important problem as single-cell data becomes more available and prevalent in genomic and epigenomic analyses. There have been a few methods developed specifically for the discovery of so-called stably expressed genes (scSEGs) in single-cell RNA sequencing (scRNAseq) [10,14]. Recently, a method called scMerge [14] used a Gamma-Gaussian mixture model to compute certain characteristics related to stability that are then used to rank genes by an "SEG index," which is the average rank across these stability properties. The assumptions made by this model, including the Gamma and Gaussian priors, are specific to the analysis of scRNAseq and would likely not generalize to other domains. scMerge additionally requires an arbitrary rank percentile cutoff, and assumes similar ranges of values across conditions. Another recent method called CORGI, which ranks genes based on their ability to capture common trajectories between scRNAseq data sets, is specifically for use in trajectory inference methods [24]. Though the goal of CORGI is to integrate data sets and therefore should be robust to distribution and value differences unlike scMerge, it optimizes for a different objective (capturing common trajectories) than the one stated here.
In bulk gene expression analysis, genes that are consistently expressed across all cell and tissue types have been known as "housekeeping genes." This term is generally used to describe genes that are required for basic cellular functioning, and many methods on many different data types have been developed for their identification [4,8,3,21,12,26,27,6]. Housekeeping genes tend to be highly active, and their expression is essential for survival [25]. They can additionally be used as controls or in normalization methods for gene expression analyses. While they have been studied for a long time, there is little consensus on which genes are most confidently considered housekeeping genes, with differing methods reporting lists with little overlap [7,21,2]. One of the reasons for differing results is that there is no consistent definition of a housekeeping gene; some studies look for genes which are simply expressed in all samples [4], others look for genes with consistent expression levels across samples [8], and still others look for genes with high expression across samples [3]. The discrepancies between results from different studies highlights the importance of rigorous methods in this space.
We introduce a parameter-free statistical method, called conserved feature discovery (CFD), for identifying features with consistently high values across samples, robust to differences in distributions and ranges of values between the samples studied. CFD works with any data type that can be viewed as a list of features with numerical values for each sample. This includes RNA sequencing (RNA-seq) data, where features are genes or transcripts and the values represent their abundance, or ChIP-seq data, where features could be genomic bins and the values are the peak heights at these locations. CFD could also be used with single-cell data, or with mass spectrometry measurements. Regardless of the underlying data type and feature set, CFD identifies the features that are statistically significantly conserved with relatively high values across input samples.
We provide theoretical guarantees and demonstrate the utility of CFD with simulation data to show tolerance to uninformative samples and false positives. CFD is applied to biological data, identifying housekeeping genes from two human tissue RNA-seq datasets, resulting in robust gene lists enriched for annotations related to basic cellular processes.

Methods
Given a set of measurements on features in multiple samples, CFD identifies features that have consistently high values across samples by computing a conservation p-value that considers both consistency across samples and magnitude within a sample. The method first converts input values to rankings representing the fraction of the sample with higher values. The median and variance of these rankings for each feature are then combined with a χ 2 test, and we use multiple hypothesis correction to return features that are statistically significant in conservation across samples and demonstrate relatively high values within each sample ( Figure 1). Details of each step are given below.

Computing relative magnitude and variance of features
The input to CFD is a matrix A ∈ R m×n of measurements on m features, over n samples or conditions, or n vectors of m feature values, which we will combine into an m × n matrix. The input data is expected to be nonnegative, so any features with zero values across all samples are removed to avoid testing unnecessary hypotheses where a feature takes the minimum of the domain under all conditions. In order to convert the given measurement values to p-values without making assumptions on the underlying distributions of measurements, we define a function ψ : R m×1 → [0, 1] m×1 that computes, for each value in the input vector, the fraction of other numbers within the input vector of greater value, returning a vector with values between 0 and 1. We apply ψ(·) to each column of A, storing the results in a matrix B: with 1[·] as the standard indicator function. This formulation makes CFD invariant to translations and positive multiplicative factors of the input samples. If the input data has negative values, it can therefore be shifted by any amount to ensure nonnegativity. We then compute the median and variance of each row of B. This gives two vectors u and v of length m, where u k = median(B k * ) and v k = Var(B k * ) for each row k of B. Intuitively, the vector u represents the median rank of each feature within its sample distribution, and v quantifies how much that ranking changes across samples.

Combination into one test statistic
We look for the features with high median ranking and low variance of rankings with respect to other features, computing u ρ = ψ(u) and v ρ = 1 − ψ(v), which measure how many features have higher median ranks (u ρ ) and how many have lower variance of ranks (v ρ ). The two vectors of p-values, u ρ and v ρ , are combined with Fisher's method, returning a vector x of χ 2 values with four degrees of freedom: These values are then converted back to p-values for multiple hypothesis correction, via the standard cumulative density function for χ 2 values.

Multiple hypothesis correction
At this stage, we have a vector of p-values that reflects the statistical significance of the medians and variances of all feature rankings within their sample distributions. However, many biological applications of this method will have a large number of features, requiring multiple hypothesis correction. We use the Benjamini-Hochberg procedure [1] to control the false discovery rate at a level of 0.05, and report only the features and their p-values if the null hypothesis can be rejected. Briefly, the Benjamini-Hochberg procedure returns an index on a list of sorted p-values, scaling the threshold based on the position in the list and total number of hypotheses. The null hypothesis can be rejected for all hypotheses up to this index.

Results
We first derive some theoretical guarantees on CFD's sensitivity and specificity under certain assumptions, then demonstrate two desirable properties using simulated data. CFD is applied to biological data using two RNA-seq data sets, identifying genes that are consistently highly expressed across broad ranges of human tissue samples.

Theoretical guarantees
Given features with median and variance ranks drawn from normal distributions, we show that with a probability dependent on the standard deviation of these distributions, the p-values of all true positive features will be less than 0.05 as long as the true positives make up no more than 10% of all features, and that all background features will have p-values greater than 0.05. For the following, let there be t true positive features in a set Y , and s background features in a set X: Y = {y i | i = 1, 2, . . . , t}, X = {x i | i = 1, 2, . . . , s}. Suppose the background features have medians and variances drawn from normal distributions: median(X) ∼ N (µ m , σ 2 ), and Var(X) ∼ N (µ v , σ 2 ). The true positives, which should have higher medians and lower variances, also have medians and variances drawn from normal distributions, but with higher or lower means, respectively, and smaller standard deviations: median(Y ) ∼ N (µ m + 2σ, σ 2 /3), and Var(Y ) ∼ N (µ v − 2σ, σ 2 /3). In particular, the true positives have medians drawn from a normal distribution centered two standard deviations higher than the background values, and variances are drawn from a distribution centered at two standard deviations lower than the background.
We first give a lemma that will be used later, defining the final p-value as a function of the median and variance ranks.
Lemma 3.1. For a feature ζ with p-value of its median rankings (there are m features with a higher median) and p-value of its variance rankings δ (there are δm features with lower variance), the p-value p ζ is given by p ζ = δ(1 − ln( δ)).
Proof. We first note that using Fisher's method to combine p-values, the χ 2 value for this feature will be x = −2(ln( ) + ln(δ)) = −2 ln( δ). In order to convert this value back to a single combined p-value, we must solve where k is the number of degrees of freedom (twice the number of p-values combined) giving k = 4 here, Γ(·) is the standard Gamma function, and γ(·, ·) is the lower incomplete gamma function: where Φ is the cumulative distribution function for the standard . For a particular true positive feature y, P (m y i ≥ m y ) = 1 − Φ(0) = 1/2 for any other y i ∈ Y . The expected number of features with a higher median than y is therefore s(1 − Φ( √ 3)) + t/2, and by similar argument the expected number of features with lower variance is also s(1 − Φ( √ 3)) + t/2. The total number of features is s+t, so both expected p-values of median and variance rankings for y are given by ). We note that y = δ y under our assumptions, so without loss of generality we will work with y for the remainder of the proof. y is a sum of two independent random variables (the fraction of background samples greater than y plus the fraction of other true positives greater than y), so its variance is given by: Var( y ) = 2σ 2 /3 + 4σ 2 /3 = 2σ 2 . Using this fact and Chebyshev's inequality, we can probabilistically bound the distance from y to its expected value: Rewriting the result of Lemma 1 as p(y) = f ( y ) = 2 y (1 − 2 ln( y )), we now want to bound p(y). Note that for y ∈ [0, 1], f ( y ) is Lipschitz with a constant of 1.5: To bound the maximum p-value below 0.05, we therefore want to bound p(y) = f ( y ) ≤ f (E( y )) + 1.5a < 0.05. Setting a = 0.003 and using the assumption s s+t ≤ 0.1, with probability at least Proof. For any background feature x with median m x , P (m x i > m x ) = 1/2 for any other x i , and . The expected number of features with greater median, as well as those with lower variance, than x is therefore s/2 + t(1 − Φ(− √ 3)). Using the same logic as above,

3)). We again use Equation 3 and Equation 4
, but now focus on the case where p(y) = f ( y ) < f (E( y )), to bound p(y) above 0.05.
Using the assumption that t t+s > 0 and setting a = 0.13, These bounds may not be practical because it is unlikely for biological data to be well represented by normal distributions, and in order for the probabilities we give to be high, the variance of the distributions (σ) must be extremely small. The Chebyshev inequality that these proofs depend on is quite weak, giving a loose bound, so more practical limits on the distributions may exist with a tighter bound. These bounds still provide some insight on the outcomes of CFD, notably showing that the conditions for avoiding false positives are much weaker than for guaranteeing discovery of all true positives.

Data
Simulated data was created to test two scenarios: tolerance of poor quality samples, and ability to avoid false positives. For biological data, bulk RNA-seq from two major studies was downloaded in .fastq format. The Illumina Human Body Map data consists of samples from 16 different normal human tissues, with two replicates of each. This is the same data used by Eisenberg and Levanon [8] to identify housekeeping genes previously. GTEx version 7 consists of 9781 samples from 55 different human tissue sites [15]. All bulk RNA-seq data was quantified as TPM values using Salmon version 0.9.1 [16]. We used the R package tximport [22] to combine transcript level quantifications to gene level quantifications, using gene annotations from GENCODE version 26 [11].

Simulation data
In order to test the level of consistency required, or CFD's tolerance to low quality samples, matrices with 10000 features over 1000 samples were generated with 0%, 5%, 10%, 15%, 20%, and 25% of the samples drawn from a uniform distribution, which we will call uninformative samples. In each case, 50 features were drawn from a normal distribution with high mean and low variance (these  On simulation data, CFD proved to tolerate a low percentage of poor quality samples, but its accuracy dropped to zero when 25% of samples were uniformly distributed. We note that in all cases, CFD never reported any false positives so the specificity in each of these experiments was 100%. While CFD maintains high accuracy for small proportions of noisy samples (there is no statistically significant difference in accuracy distributions between 0%, 5%, and 10% under the Kolmogorov-Smirnov two-sample test), the accuracy rapidly decreases when over 10% of samples do not preserve the high median and low variance of the true features, and goes to zero when 25% of samples do not match the conservation pattern (Figure 2a).
CFD is also able to identify cases in which none of the input fits the pattern of consistently high values. Previous methods such as scMerge [14] rank features by some conservation metric and pick the top n% as the conserved features, for some user-specified n. When no input features are truly consistently high, an approach like this will simply result in a list of false positives. In contrast, we find that CFD returns no statistically significant features when all input features are either high mean and high variance, or low mean and low variance, across 14 different parameter settings (Figure 2b)). Therefore in simulation data CFD appears highly robust to false positives.  Housekeeping genes are typically defined as genes required for the maintenance of basic cellular functions, and they are expected to be relatively highly expressed in all cell and tissue types. Robust identification of these genes has proven challenging.

Human Body Map.
We identified 168 genes that passed statistical significance and multiple hypothesis testing using CFD. These genes are all consistently near the top of their respective sample distributions, demonstrating high values and low variance as desired. The top ten genes with lowest p-values are reported in Table 1, and the full list can be found in Supplemental Data. Nine of these genes are mitochondrially encoded genes. Mitochondrial DNA contains genes that are necessary for mitochondrial function [23], therefore it is reasonable to see these genes identified as consistently highly expressed across samples from various tissue types.
Our set of housekeeping genes is enriched for GO terms related to basic cellular functions and processes, across all three GO categories ( Table 2). These results were obtained using gProfiler [17], with the ordered list of genes as input and the background as all human genes. Full GO enrichment results can be found in Supplemental Data.

GTEx: filtering out low quality samples.
Not all of GTEx data is of high quality as noted in previous GTEx studies [5], so we filtered out lower quality samples. With the help of MultiQC [9], we used mapping percentage (percent of reads that could be mapped during quantification) as a proxy for data quality and filtered GTEx by this value. Running CFD on the full GTEx version 7 release (9781 samples) returned no significantly conserved genes. Plotting the p-value distributions for the GTEx data shows that including the low quality data produces an unexpected distribution with unexplained peaks, which persists even when considering data with at least 60% of reads mapped (Figure 3a). This figure also suggests that many genes are very far from satisfying the property of high values across samples, as seen by the large number of genes with very high p-values. More significant genes could be obtained by filtering out the genes with very high p-values, thereby better satisfying the expectation of a uniform distribution of p-values and testing fewer unnecessary hypotheses. To verify that the genes we identified by thresholding the data were due to the higher data quality rather than simply a smaller sample size, we took ten random subsets of 8054 samples from GTEx, and ran CFD on each of them. None of these random subsets returned any significant genes, and all showed an uneven p-value distribution consistent with the full data (Figure 3b).
On the subset of 8054 high quality GTEx samples, 149 genes passed statistical significance and multiple hypothesis testing. This gene list is enriched for similar terms as the set from the Human Body Map, again representing terms fundamental to cellular functioning. The top three terms from each category on each set (Human Body Map and GTEx) are provided with p-values in Table 2. For both the biological process and cellular component categories, the top three terms from Human Body Map and GTEx were not identical, so all terms in the top three of either list are reported. The top ten genes of the GTEx list are the same as the top ten we found from the Human Body Map data, though in a different order (Table 1).

Comparisons to previous work show little agreement in housekeeping gene lists
Housekeeping genes have been identified using many different data types and methods, with generally little agreement between them [7,21,2]. We compared our results with three studies from within the last ten years (Figure 4). Two of these previous studies identified many more housekeeping genes than we did (3804 and 2064), while the third resulted in a list of only 27 genes. Chang et al. [4] computed housekeeping genes as those that are universally expressed in normal tissue, based on microarray samples. Eisenberg and Levanon [8] used the same Human Body Map RNA-seq data used here and defined housekeeping genes as those expressed at a constant level in all tissues. Caracausi et al. [3] searched for genes with high expression values and low standard deviation that were present in a large majority of samples, based on specific cutoff values for each criteria. These definitions all differ somewhat from each other and from the objective of CFD, which looks for genes that are consistently highly expressed across tissues, relative to their sample distributions. For both of our gene lists from GTEx and from the Human Body Map data, we find only 1 gene in common with all three previous lists (RPL8, a ribosomal protein), and about 50 genes in common with both Chang et al. [4] and Eisenberg and Levanon [8] (Figure 4a)  definitions, only approximately 30 genes in our set were not found in any of the previous lists, and half of these were mitochondrially encoded genes, which may not have been considered by previous studies.
Most studies report a short list of their most confident housekeeping genes, and we see little consistency in these lists across methods. Caracausi et al. [3] gave a list of eight genes, and Eisenberg and Levanon [8] reported eleven. Chang et al. [4] did not provide a short list, but gave their full ranking, and we pulled the top ten from this list. Among the eight genes listed as best fitting the criteria of Caracausi et al. [3], four were not in either of our lists, despite the more similar definition of a housekeeping gene (Figure 4b). There is almost no overlap between the three previously published lists of "highly confident" housekeeping genes ( Figure 4c); only two genes are shared between two studies, while the third study has no genes in common with either of the other two. Using CFD on our two data sets, we find a fairly large overlap in the full lists (Figure 4d), and as previously noted the top ten genes from both Human Body Map and GTEx are identical, suggesting that CFD returns fairly consistent results. Taken together, these results highlight the level of uncertainty and importance of methods in identifying housekeeping genes.

Discussion and Conclusions
We have introduced a general statistical method called CFD that identifies features with consistently high values across input conditions, proved guarantees on its specificity and sensitivity, and demonstrated its effectiveness through simulated data and by identifying human housekeeping genes on two bulk RNA-seq data sets. CFD requires no parameters and makes no assumptions about the underlying distributions of or relationships between input samples. Simulated data suggests that CFD requires consistency across at least 80% of samples to identify any meaningful features, and has very high specificity. The housekeeping genes that we identify are consistent between two very different RNA-seq data sets, and gene annotations suggest that the genes we identify are involved in fundamental cellular processes, as we would expect for housekeeping genes. It is likely that more than the 149 or 168 genes identified by CFD satisfy the definition of housekeeping genes. The relatively low numbers reported here may be due to the very large number of genes that are either variable across tissue types or have low expression values, as shown by the large numbers of high p-values ( Figure 3). If desired, more careful filtration of these genes that clearly do not fit the desired properties would likely yield more genes passing the multiple hypothesis testing procedure, and a larger list of housekeeping genes. The application to GTEx data in particular showed that CFD can sometimes benefit from some preprocessing or filtration steps to ensure the input data is not obscured by poor quality samples.
The framework of CFD can be easily adapted to identify any combination of high or low median and high or low variance features, simply by changing the direction of the inequality in ψ. In other applications, measurements with low values or high variance might be more of interest, and our statistical framework could be adapted in a straightforward manner to identify such features. In yet other applications it may be desirable to weight the relative importance of median and variance, which could be done by switching the p-value combination from Fisher's method to Stouffer's Z-score method, in which weights are straightforward to introduce. This general statistical method represents a step towards principled analyses of conserved realvalued features across multiple conditions, and its framework can be easily adapted for different objectives. CFD could be applied to any data in which the same features are measured under different conditions, including gene expression, ChIP-seq, and protein quantifications.

Availability
Code for CFD and scripts to reproduce the figures and analysis in this work, along with Supplemental Data files, are available at https://github.com/Kingsford-Group/cfd.

Funding
This work has been supported in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., and by the US National Institutes of Health (R01HG007104 and R01GM122935). Research reported in this publication was supported by the NIGMS of the NIH under award number P41GM103712. This work was partially funded by The Shurl and Kay Curci Foundation.