Correcting for Dependent P-values in Rhythm Detection

There is much interest in using genome-wide expression time series to identify circadian genes. Several methods have been developed to test for rhythmicity in sparsely sampled time series typical of such measurements. Because these methods are statistical in nature, they rely on estimating the probabilities that patterns arise by chance (i.e., p-values). Here we show that leading methods implicitly make inappropriate assumptions of independence when estimating p-values. We show how to correct for the dependence to obtain accurate estimates for statistical significance during rhythm detection.


Introduction
Molecular rhythms are now commonly identified by statistically analyzing whole-genome expression time series measurements. To this end, a number of rhythm detection algorithms have been introduced [14,6,15,13,3,16,5,8]. Although different algorithms are sensitive to different data features, they all seek to estimate the probabilities that perceived patterns arise by chance from experimental uncertainties (i.e., p-values). Because the p-values are often compared with desired significance thresholds to determine which genes are considered rhythmic, accurate p-value estimates are important.

Comparisons of multiple reference waveforms to a single time series
are not independent JTK CYCLE [5], RAIN [13], and eJTK [6] all use reference waveforms with different phases (and in the case of RAIN and eJTK, different peak to trough and trough to peak times) to find the reference waveform that best matches the experimental waveform according to a test statistic (Kendall's τ for JTK CYCLE and eJTK, Mann-Whitney U for RAIN). The three methods, however, take different approaches to obtaining a single p-value for the time series from the many p-values obtained for each reference waveform comparison. JTK CYCLE applies the Bonferroni correction, which was developed to control the Family-Wide Error Rate: the probability of obtaining one or more false positive results. The Bonferroni correction multiplies the best-matching p-value by the number of comparisons to produce an adjusted p-value that is too large, as described in Hutchison et al. 2015 [6].
RAIN instead uses the Benjamini-Hochberg method [1] (also known as the q-value approach [12]), which was developed to control the False Discovery Rate: the proportion of positive results that are false. As shown in the context of JTK CYCLE in Hutchison et al. 2015, [6], the Benjamini-Hochberg method also results in inaccurate p-values. The reason is that the Benjamini-Hochberg method assumes that the input p-values are independent. However, the p-values from different reference waveforms are correlated since they all concern the same experimental time series. Running RAIN on simulated data generated from independent draws from the normal distribution (i.e., under the null hypothesis) results in p-values that are not uniformly distributed (Fig. 1A).
For the reasons discussed in the Introduction, for N time series generated under the null distribution, the test statistic with i test statistics higher than it should have p-value Consequently, we can compare the output p-values to empirical p-value estimates (i.e., Eq. 1) by calculating the ratio r = output p empirical p . The results are shown in Fig. 1B. Ideally, a method should yield a horizontal line at r = 1 (the dashed line in Figure 1B). Instead, consistent with the left peak in Fig. 1A, low p-values are smaller than they should be (r < 1), and, consistent with the right peak in Fig. 1A, high p-values are larger than they should be (r > 1). The former case (r < 1 for low p-values) tends to produce false positives-rhythmic patterns are accorded more significance than they should be. A solution to this problem is to use the empirical p-values to judge the significance of RAIN results.

P-values produced by MetaCycle are not uniform under the null distribution
Given that there are a variety of rhythm detection methods available that focus on different aspects of experimental time series, it seems like it should be advantageous to combine methods. Recently, Wu et al. [14], implemented this idea by using Fisher integration [4] to combine the p-values from JTK CYCLE [5], Lomb-Scargle [9,11], and ARSER [16] in a method called MetaCycle. In this section, we analyze this method and its component algorithms from the perspective of the previous section. In the following section, we consider the Fisher integration explicitly.
We generated 1000 time series of 12 time points spaced at 2-hour intervals by drawing values from the normal distribution. We did not include replicates for time points because ARSER is only employed in MetaCycle when replicates are not available, and we wanted to test the three-algorithm combination. Consistent with Hutchison et al. 2015 [6], the p-values produced by JTK CYCLE were too large. The p-values from LS were also too large ( Fig. 2A), but those from ARSER were reasonably accurate (r ≈ 1). However, the p-values for the combination (Meta2D) are non-uniformly distributed, with underestimates of the p-values for low values and overestimates for high values ( Fig. 2A).
To make these data more accurate, we took an analogous approach to above and to Hutchison et al 2015 [6]. We ran MetaCycle on 1,000,000 randomly generated time series and used those results   To correct for the correlation, we adopt the Brown method, a modification of the Fisher method for one-sided tests of significance [2]. The Brown method uses the covariance between the different p-value sets to adjust the combined test statistic (χ 2 ) distribution and produce adjusted p-values. In the case where the p-values are independent, it provides the same results as the Fisher method (Fig. 3A). When the p-values are not independent, however, the Brown method produces accurate integrated p-values with r ≈ 1 (Fig. 3B), indicating that they are uniformly distributed from 0 to 1.

Properly combining p-values does not appear to result in improved rhythm detection
Having discussed how to adjust the methods from MetaCycle to produce accurate p-values and how to use Brown's method for accurate integration, we now explore the effects of the Fisher method and the Brown method on combining different methods. As noted above, the former leads to problems when the p-values produced by different methods are correlated. For a dataset of 12 time points drawn from a Gaussian distribution, the R 2 between empirically-corrected pvalues produced by JTK CYCLE and Lomb-Scargle is R 2 = 0.388, that between JTK CYCLE and ARSER is R 2 = 0.392, and that between Lomb-Scargle and ARSER is R 2 = 0.999. As a result, pairwise Fisher Integration yields p-values roughly 1/100-th to 1/1000-th of the values that they should be (from ordering the combined test statistics), and three-way Fisher Integration yields values that are 1/10,000-th the values that they should be (Fig. 4A). By contrast, when using the Brown method, r ≈ 1 in all cases (Fig. 4B).
Having corrected the p-value integration, we compared the combinations of the methods to the individual methods for rhythm detection. We generated 396 rhythmic time series by adding Gaussian noise plus 10 (to avoid negative values) to a cosine with 24 points sampled evenly across 2 periods (analogous to sampling every 2 hours for 48 hours across 2 24-hour periods); the standard deviation of the Gaussian noise was equal to the amplitude of the cosine. For these time series, the R 2 between empirically-corrected p-values produced by JTK CYCLE and Lomb-Scargle is R 2 = 0.468, that between JTK CYCLE and ARSER is R 2 = 0.297, and that between Lomb-Scargle and ARSER is R 2 = 0.431. To analyze the classification strength of the different methods, we combined the randomly generated data with the cosine-generated data, yielding a dataset with ≈ 28% rhythmic time series, which is in line with tissue-specific percent rhythmicity estimates [7]. During rhythm detection, it is important to account for correlations in p-values that arise from the application of multiple tests to the same experimental time series. Here, we show that this issue impacts the correction for testing across reference waveforms in JTK CYCLE [5] and RAIN [13], and it impacts the integration of p-values from multiple methods in MetaCycle [14]. While it can be computationally costly to correct the p-values empirically [6,7], Brown's correction of Fisher integration involves little additional effort [2]. Surprisingly, we find no advantage to combining methods with accurate p-values. This may be due to the fact that most rhythm detection methods overlap in the aspects of the time series that they examine. This high level of correlation results in a type of multiple hypothesis comparison problem that increases the p-values without likewise increasing the sensitivity of the combined methods to detect rhythmicity. For method combination to be effective, new methods of rhythm detection would need to be developed that approach rhythm detection from new directions.

Acknowledgments
This work was completed in part with resources provided by the University of Chicago Research Computing Center.