Post-selection Inference Following Aggregate Level Hypothesis Testing in Large Scale Genomic Data

In many genomic applications, hypotheses tests are performed by aggregating test-statistics across units within naturally defined classes for powerful identification of signals. Following class-level testing, it is naturally of interest to identify the lower level units which contain true signals. Testing the individual units within a class without taking into account the fact that the class was selected using an aggregate-level test-statistic, will produce biased inference. We develop a hypothesis testing framework that guarantees control for false positive rates conditional on the fact that the class was selected. Specifically, we develop procedures for calculating unit level p-values that allows rejection of null hypotheses controlling for two types of conditional error rates, one relating to family wise rate and the other relating to false discovery rate. We use simulation studies to illustrate validity and power of the proposed procedure in comparison to several possible alternatives. We illustrate the power of the method in a natural application involving whole-genome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project.


Introduction
In large scale analysis of genomic and genetic data, it is common to perform tests for hypotheses at an aggregate level over "classes" of multiple related units. In analysis of genome-wide association studies, for example, testing for genetic associations may be performed at a gene-, region-or pathway-level by aggregating association statistics over multiple genetic markers (Yu et al., 2009;Wu et al., 2010a,b). Similarly, for identification of susceptibility markers with pleiotropic effects, association tests for each individual genetic marker may be performed by aggregating association test-statistics over multiple related phenotypes (Bhattacharjee et al., 2012;Peterson et al., 2015).
Recently, in microbiome-profiling studies, association tests for identifying taxa were performed by aggregating association test-statistics over multiple traits (Hua et al., 2016). When multiple signals are expected within classes of related units, use of suitable aggregate-level test-statistics can improve the power of discovery of the signal harboring classes, compared to one unit at a time analysis.
In using aggregate level test-statistics for large scale hypotheses testing, much research have focused on development of test-statistics that can combine evidence of signals from multiple units in the most powerful way. For any given test-statistics, standard multiple testing adjustment methods are typically applied to adjust for the number of classes, which could be potentially large, over which the analyses may be performed. An important area of research that has received less attention involves how to perform proper post-selection inference following the aggregate level hypothesis testing to isolate individual units that contain signal within the higher-level classes. For example, if a genetic marker is identified to be statistically significant for aggregate-level association in pleiotropic analysis across multiple phenotypes, it is then very natural to perform follow-up analysis to identify which underlying phenotypes contain true signals for associations.
Clearly, a naive approach that ignores the effect of class selection on the second-stage of individual units selection will produce biased inference even if multiple testing adjustment methods were used to adjust for the number of units within each selected class.

1
In this manuscript, we study the problem of post-selection inference following aggregate level analysis as a two-stage hypothesis testing problem. The data of interest, the unit-level test-statistics, can be organized in an m × n matrix, where m is the number of classes for which inference is of interest, and n is the number of units within each class. For example, in GWAS of different phenotypes the data matrix for analysis has rows indexed by the SNPs, and columns indexed by the phenotypes.
The first goal is to select the rows (SNPs) that have signal in at least one column (phenotype), and the post-selection goal can be to identify which columns have signal within each of the chosen rows.
Approaches to various error rate controls have been proposed in the past for multi-stage or hierarchical hypothesis testing. Possibly most relevant of them in the context of applications described above is a procedure proposed by Benjamini and Bogomolov (2014) which control false positives on the average over the selected rows. The method was recently applied to pleiotropic analysis of GWAS and single tissue eQTL data (Peterson et al., 2015(Peterson et al., , 2016, a type of application that we also use for illustration in this manuscript. Barber and Ramdas (2015) suggested controlling the FDR at an aggregate level of rows or columns, in addition to controlling the FDR across all individual hypotheses. Liu et al. (2016) suggested controlling a posterior measure of false discoveries within each selected row, in addition to overall. Yekutieli (2008) suggested controlling different FDR-types on trees of hypotheses, assuming independence between the stages of the hierarchy. Some other strategies have been tailored to special application fields, such as gene expression analyses Li and Ghosh , 2014), electrencephalography research (Singh and Phillips, , 2010), and functional magnetic resonance imaging (Schildknecht et al. , 2015). Conditioning on the selection event has been suggested in novel works on post-model selection (Fithian et al. , 2015;Lee et al. , 2016), as well as for spatial signals in Benjamini and Heller (2007).
For the selection of columns in the second-stage of above hypothesis testing framework, we propose developing multiple testing procedures that guarantees control of false positive rates conditional on the fact that the row is selected. As a researcher may often conduct different experiments for each selected row for follow-up studies, control of false positive at the level of each row may be desirable rather than controlling average rates over all selected rows as has been considered for this problem by Peterson et al. (2015). It has been noted earlier (Benjamini and Bogomolov , 2014) that development of a general procedure for controlling such error rates could be difficult. In the setting where the test-statistics can be assumed to be independent across columns, we propose general procedures that are theoretically shown to control family-wise error rates and false discovery rate at the level of each row conditional on selection. We then show through simulation studies that for a commonly used aggregation test-statistic, the proposed procedure not only provides stronger type-I error rate control but it also has a major power advantage when compared to the general procedure proposed by Benjamini and Bogomolov (2014). An application of the methods for analysis of SNP markers predictive of gene expression level across multiple tissues using data from the The Cancer Genome Atlas (TCGA) project provided further empirical validation of substantially higher power of the proposed method compared to alternatives. These results provide encouraging direction for further research in developing more powerful procedures for two-stage hypothesis testing in more general frameworks.
In section 2, we set up the proposed hypothesis testing framework more formally and define the criterion of type-I error that we propose to control. In section 3, we present a procedure for controlling the proposed type-I error criterion and provide theoretical results supporting its validity.
In section 4, we conduct simulation studies to evaluate type-I error rates and power for the proposed method compared to a naive approach and to the procedure proposed by Benjamini and Bogomolov (2014). In section 5, we describe results from analysis of TCGA study. In Section 6, we conclude with a discussion about future directions.

Notation and Goal
Let H ij , j = 1, . . . , n be the family of n null hypotheses for the ith row, and let P ij , j = 1, . . . , n be their corresponding p-values, for i = 1, . . . , m. The second stage analysis is based on the individual p-values {p ij : j = 1, . . . , n, i ∈ S}, to answer questions such as which columns within selected rows contain signal.
Let V i and R i be the number of false and total rejections for row i, respectively. Our first error measure, which henceforth is referred to as the conditional FWER for a selected row, is the probability of at least one erroneous rejection within the row, conditional on it being selected, Our second error measure, which henceforth is referred to as the conditional FDR for a selected row, is the expected fraction of erroneous rejections among the rejections within the row, conditional on it being selected, If n = 1, our error measures coincide with the selective type I error rate, i.e. the error rate of a test given that it was performed, as suggested by Fithian et al. (2015).
Our goal is to suggest valid multiple testing procedures for controlling the conditional FWER/FDR for the selected rows.

Valid inference within a selected row
In this section we provide procedures for controlling the conditional FWER/FDR for a selected row. The valid inference we provide relies on the assumption that the the columns are independent, as is typical in meta-analyses. However, across rows (i.e., within columns), the individual level test-statistics (or p-values) may be dependent.
The case where a row is selected if the aggregate test passes a predefined threshold is addressed in § 3.1. We provide procedures that control the conditional FWER/FDR for any type of dependency across rows. This type of selection is standard practice in GWAS meta-analysis, see § 5 for a rich data application.
The case where a row is selected based on a data-dependent threshold is addressed in § 3.2. We provide procedures that for independence across rows control the conditional FWER/FDR, as well as the average error rate (Benjamini and Bogomolov , 2014). We discuss conditional and average error rate control using our approach when there is dependency across rows.
3.1 Inference following selection of rows using a fixed cut-off on the aggregate level test-statistics Even in the simplest setting where the row selection is based on a fixed cut-off, adjusting for selection is not obvious since the probability of the selection event depends on the unknown distribution of 5 the non-null p-values within the row. In § 3.1.1 we suggest a way around this problem, which leads to valid conditional p-values. Although the original p-values were independent, the conditional pvalues in the selected row are dependent due to the selection event. In § 3.1.2 we present multiple testing procedures based on these conditional p-values that control the conditional error rate.

The conditional p-value computation
For each row i ∈ {1, . . . , m}, the p-values corresponding to the different columns are assumed independent. The computation of p iG can be based on different aggregates of the p-values p ij , j = 1, . . . , n, see Loughin (2004) for common choices and a review. A popular choice is Fisher's combining method, which has been shown to have excellent power properties (see, e.g., Owen (2009) and the references within). Let f : n → be a combining function for testing the global null on each row. For example, Fisher's combining function is f (p i1 , . . . , p in ) = −2 n j=1 log p ij . Under the global null of no signal in the row, if the p-values in the row are uniformly distribution, −2 n j=1 log P ij has a chi-squared distribution with 2n degrees of freedom, −2 n j=1 log P ij ∼ χ 2 2n , so p iG = P r(χ 2 2n ≥ −2 n j=1 log p ij ).
(3.1) See Remark 3.1 and Section 5 for other combining functions.
Let t be the fixed row selection cut-off, so that row i is selected if f (p i1 , . . . , p in ) ≥ t. We require that the combining method be non-increasing in each of its arguments, when the other n − 1 values are held fixed. This is a reasonable requirement since if the individual p-values decrease, the selection of the row should become easier, so if f (p i1 , . . . , p in ) ≥ t and p ij ≤ p ij , j = 1, . . . , n, then f (p i1 , . . . , p in ) ≥ t. For simplicity, we shall assume that the p-values are continuous, with a uniform null distribution, and therefore that f is continuous.
Since we assume we are dealing with a selected row based on the global test, we will omit the index i for row from hereon to simplify the notation. The n independent p-values for a selected row are therefore denoted by P 1 , . . . , P n , and their observed values are p 1 , . . . , p n . We observed that we rejected the global null which means that the p-values satisfy f (p 1 , . . . , p n ) ≥ t.
We let b j satisfy The aim is to provide a test for each of the n hypotheses having observed that the global null was rejected. To this end, we adjust the p-values to These are valid conditional p-values. To see this, note that P j conditional on the row being selected and on the remaining p-values is truncated at b j ∈ (0, 1]. So if unconditionally P j is uniformly distributed, P j ∼ U (0, 1), then conditionally on being selected and on the remaining p-values it has a uniform distribution between zero and b j , P j | f (p 1 , . . . , p j−1 , P j , p j+1 , . . . , p n ) ≥ t, P 1 = p 1 , . . . , P j−1 = p j−1 , P j+1 = p j+1 , . . . , P n = p n ∼ U (0, b j ).
For Fisher's combining function, f (p 1 , . . . , p n ) ≥ t is equivalent to Π n j=1 p j ≤ e −t/2 . Therefore, b j = min e −t/2 /(Π n l=1,l =j p l ), 1 , and the conditional p-values are: The magnitude of the inflation in the post-selection p-values, p j − p j , depends on how the signal is distributed in the row. Using Fisher's combining method, there can be no inflation (i.e., no cost for selection!) if the signal is strong in at least two columns. This is a direct result of the conditional p-value computation (3.4), where p j = p j if Π n l=1,l =j p l ≤ e − 1 2 t . Note that even though it is enough to have one strong signal for the selection of the row, at least two columns need to contain strong signals for all n conditional p-values to coincide with the original p-values. Therefore, computing the conditional p-values as suggested in (3.3) will typically lead to smaller conditional p-values whenever signals are present in more than one column, than the use of the following probabilities, which are computed under the global null: The computation in (3.5) uses conservatively the uniform distribution for the p-values, P k ∼ U (0, 1) for k = 1, . . . , n, and the resulting probabilities can be substantially larger than the original p-values: for Fisher's combining method, if t = χ 1−α/m,2n and p j ≤ e − 1 2 t , the value will be m α p j .
If the row is selected when n j=1 z j / √ n > t, then the conditional p-values are: The computation of the conditional p-values is more complex using the test of Bhattacharjee et al. (2012) for the global null, called ASSET. Their test is based on identifying the set S max that has the largest weighted Stouffer test statistic, and the significance at the aggregate row level is based on the maximal test statistic, j∈Smax w j Z j / |S max |. Bhattacharjee et al. (2012) showed that their global test can be more powerful than a test statistic that aggregates all the information in the row without subset selection.
Remark 3.2 (The ranking of the original and conditional p-values). Using Fisher's or Stouffer's combining method, it is easy to show that the ranking of the p-values remains unchanged, i.e., if p 1 ≤ . . . ≤ p n then p 1 ≤ . . . ≤ p n . Other combining methods may lack this desired property.

Valid procedures for conditional FWER/FDR control
Applying a valid FWER/FDR controlling procedure on {p j : j = 1, . . . , n} will guarantee control of the conditional FWER/FDR for a selected row. The choice of FWER/FDR controlling procedure should be guided by the observation that the n conditional p-values in the row are dependent, despite the fact that the original p-values were independent across columns. The conditional FWER for a selected row will be controlled if we use Bonferroni-Holm on the conditional p-values within each selected row, since the Bonferroni-Holm procedure guarantees FWER control for any dependency between the p-values. For conditional FDR control, we would like to use the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg , 1995) within each selected row at level α. The next theorem shows that for the dependency among the p-values induced by the conditioning step, the BH procedure is indeed valid.
Theorem 3.1. Let P 1 , . . . , P n be independent p-values, each with a uniform null distribution. If f (p 1 , . . . , p n ) ≥ t, then the BH procedure at level α on p 1 , . . . , p n controls the conditional FDR at level ≤ n 0 n α, where n 0 is the number of true null hypotheses.
Proof. Without loss of generality, relabel the columns so that the first n 0 columns have a true null hypothesis. We make use of the representation of FDR from Benjamini and Yekutieli (2001), so the conditional FDR is where I = 1 if H 1 (the first hypothesis), which we assume to be null, is rejected, and R is the number of rejected hypotheses. We condition on p 2 , . . . , p n so that it is sufficient to show that (3.7) The only random quantity in Q is P 1 = P 1 /b 1 , and it is uniformly distributed between 0 and 1 given the selection event. As p 1 increases b 2 , . . . , b n will be non-increasing so there must be 0 = a 0 < a 1 < Interestingly, for Fisher's selection rule the conditional FDR is exactly n 0 α/n.
Corollary 3.1. Let P 1 , . . . , P n be independent p-values, each with a uniform null distribution. If f (p 1 , . . . , p n ) = −2 n j=1 log p j ≥ t, then the conditional FDR of the BH procedure at level α on p 1 , . . . , p n is equal to n 0 n α.
See Appendix A for the proof. where π 0 = n 0 /n. When n is large enough (say 30 or more), it may be useful to estimate π 0 and incorporate the estimate in the multiple testing procedure to gain power. The gain in power can be substantial if π 0 is far less than one, as appears to be the case in the application considered in Section 5. Schweder and Spjotvoll (1982) proposed estimating π 0 by #{p−values>λ} , where λ ∈ (0, 1). The has been incorporated into multiple testing procedures in recent years. For independent p-values, Storey (2003) showed that the BH procedure at level α/(π 0 ) controls the FDR at level α. Benjamini et al. (2006) suggested another estimator, and noted that the BH procedure which incorporates the plug-in estimator with λ = 0.5 is sensitive to deviations from the assumption of independence, and its FDR level may be inflated above the nominal level under dependency. Blanchard and Roquain (2009) provided extensive simulations that show that the BH procedure which incorporates the plug-in estimator with λ = α is robust to deviations from the assumption of independence, at the price of being slightly more conservative.
We need to investigate which estimator will result in an adaptive BH procedure that controls the conditional FDR. Also, a confidence statement about π 0 within selected rows may be of interest in itself, in providing an upper bound on the fraction of nulls (or a lower bound on the fraction of columns with signal) within each selected row.

Inference following data adaptive row-selection rules
In § 3.2.1 we show that our theoretical results carry over to more general row selection rules when the test statistics within each column are independent. In § 3.2.2 we show the connection with the average error rate (Benjamini and Bogomolov , 2014). In § 3.2.3 we discuss conditional and average error rate control using our approach when the test statistics within each column are dependent.

Valid inference under row independence
The results in § 3.1.2 hold if t i does not depend on P i1 , . . . , P in . The independence between the threshold and the ith row p-values is clearly satisfied in the setting of § 3.1, where the thresholds is fixed given m and n (or more generally, given the number of non-missing observations in row i, say n i ). For example, for GWAS, the threshold is often chosen so that the FWER on the m global tests is controlled at the desired nominal level α. If rows are selected by Bonferroni, then the selection threshold for Fisher's combining method is t = χ 1−α/m,2n , i.e., the (1 − α/m) quantile of χ 2 2n .
If the p-values across rows are independent, more general data-adaptive thresholds can also lead to valid inference within selected rows. This is clearly so if the threshold for selection depends on The class of simple selection rules, which includes row-selection by the BH procedure or by Bonferroni-Holm on the global null p-values, leads to valid conditional inference within selected rows.
Definition 3.1 (Definition 1 in Benjamini and Bogomolov (2014)). A selection rule is called simple if for each selected row, when the p-values not belonging to that row are fixed and the p-values in that row can change as long as the row is selected, the number of selected rows remains unchanged.
Theorem 3.2. Assume that rows are selected by a simple selection rule such that row i is selected if f (p i1 , . . . , p in ) ≥ t(|S|). If the p-values across rows are independent, then if we compute the conditional p-values for selected row i as in (3.3) with t = t(|S|): See Appendix B for a proof. Let C i be the (unobserved) random variable whose conditional expectation is the desired error rate, so for FDR control. It follows from Theorem 3.2 that the conditional error is controlled, since 3.2.2 Relation to the approach of Benjamini and Bogomolov (2014) Benjamini and Bogomolov (2014) considered selective inference on families of hypotheses, which are defined by rows in our paper. They showed that applying a Bonferroni procedure in each selected row may result in a highly inflated conditional FWER when the selection is based on within-row p-values. Moreover, they noted that the goal of conditional control for any combination of selection rule and testing procedure and for any configuration of true null hypotheses is difficult to achieve.
Indeed, our procedures for conditional control rely on the ability to compute the conditional pvalues, which requires well-defined selection rules and independence across the p-values within the rows. Benjamini and Bogomolov (2014) considered a different error measure addressing selective inference, which can be controlled under more general conditions. Specifically, they suggested to control an expected average error measure over the selected rows. While they considered a general 13 class of error rates, we focus on two special cases. The control of the FWER on the average, , and the control of the FDR on the average, E .
Controlling the FWER/FDR on the average is useful when we require control over false positives on the combined set of discoveries across selected rows. When we require control over false positives within each selected row separately, the conditional FWER/FDR is more appropriate.
Conditional error control can guarantee average error control, as stated in the next theorem.
See Appendix C for the proof which is straightforward. We know the condition is satisfied if the global null p-values are independent and the selection rule is simple from Theorem 3.2. Benjamini and Bogomolov (2014) showed that by applying an FWER/FDR controlling procedure within each row at level |S|α/m, if the set S is selected by a simple selection rule, then the FWER/FDR on the average is controlled at level α. However, the conditional FWER/FDR for a selected row may exceed α in some of the rows as our simulations show. Comparing the conditional approach to the approach of Benjamini and Bogomolov (2014), we can generally say that the conditional approach is most sensitive to the number of non-null columns in the row, and the approach of Benjamini and Bogomolov (2014) is most sensitive to the fraction of selected rows, |S|/m. Our simulations and real data example demonstrate that for a small fraction of selected rows, |S|/m, the conditional approach is expected to make more discoveries than the approach of Benjamini and Bogomolov (2014). However, for sparse signal within selected rows (signal presence in only two columns or less), the approach of Benjamini and Bogomolov (2014) may make more discoveries.

Error control following selection under row dependence
So far, we assumed that within each study, the test statistics are independent and hence the global null p-values are independent. We now consider the case that the studies (columns) are independent, but within each study (column), the set of all p-values may be dependent. This setting was considered in Benjamini and Bogomolov (2014), and although their procedures do not guarantee the control of the error rate on the average for general dependency, they prove that the error rate on the average is controlled if the p-values are positive regression dependent on the subset (PRDS) of true null hypotheses (Benjamini and Yekutieli , 2001).
With dependence, the global null p-values are no longer independent. Clearly, dependence has no effect on the conditional error guarantees of Bonferroni-Holm or the BH procedure on the conditional With a simple selection rule such as BH on the global null p-values, the conditional error may not be maintained. In § 4 we show empirically that for dependent column p-values, the conditional error is controlled, as expected, when the selection rule is fixed, but it may be inflated when the BH procedure is used for selection. However, the average error rate remained below the nominal level even with BH row selection, and it was highest when the signal was strongest.
In the Supplementary Material (SM) we provide theoretical results when the p-values are independent within rows, but PRDS in each column across the rows. We show that if all m rows are null rows with only true null hypotheses, the conditional approach when BH is used both for the global hypotheses at level q and within each row at level α, controls the average FDR at level ≤ α ×q α.
We also show that if some rows are non-null with at least one false null hypothesis, then if the pvalues for the global null hypotheses of non-null rows is ostensibly 0, using the conditional approach when BH is used both at the global level and within each row also controls the average error at a level close to α.

Simulations
In order to assess the performance of different post-selection approaches, we carry out simulation studies. We have three specific aims in this simulation: (1) to examine the possible inflation in the number of false positives when failing to account for the selection (i.e., when using the original p-values in selected rows); (2) to compare the power of methods that properly account for selection, i.e., our novel approach that uses conditional p-values, and the approach of Benjamini and Bogomolov (2014); and (3) to study the effect of dependency within the columns.
We have two data generation settings. First, the row independence setting, for which all test statistics are independent. We sample the unit-level test statistics, Z ij , i = 1, . . . , m, j = 1, . . . , n, independently from the normal distribution as follows. If the null hypothesis H ij is true, the test statistic has a standard normal distribution; if the null hypothesis H ij is false, the test statistic has a normal distribution with mean µ ∈ {0.5, . . . , 7} and variance of one. The p-value is where Φ(·) is the standard normal cumulative distribution function. In each of n columns, m = 1000 rows are examined, and m 1 = 10 rows contain signal in n 1 of the n columns. So we generated n 1 ×m 1 entries in the n × m matrix of p-values that contained signal we aim to discover.
Second, the row-dependence setting, for which within each column of m hypotheses, we generate m/B independent blocks with within block dependency. The covariance matrix of the B z-scores within a block was The vector of B z-scores within the block is multivariate normal with covariance Σ. If the block does not contain signal the mean is zero, and if it contains signal the mean is We consider the following post-selection analyses using the BH procedure. The BH procedure on each selected row i ∈ S: at level α on the original p-values (BH-naive); at level |S| m α on the original p-values, as suggested by Benjamini and Bogomolov (2014) (BH-BB); at level α on the conditional p-values in (3.4) (BH-cond). We also considered using the Bonferroni-Holm procedure, and these results turned out to be qualitatively the same as the results using the BH procedure, see details in the SM.
To evaluate the power of the different procedures, we computed the average power, i.e., the average number of true rejections divided by the number of entries with true signal, as well as the average power for a specific row, i.e., the average number of true rejections in the row divided by n 1 .

Results for the independence setting
The results on the average and conditional FDR control show that the control of false positives was tightest using the conditional approach, since it controlled all the error measures (as expected) at the nominal 0.05 level. The approach of Benjamini and Bogomolov (2014) controlled at the nominal 0.05 level the average FDR/FWER (as expected), as well as the conditional FDR/FWER for a correctly selected row (i.e., a row that contains signal), but did not control the conditional FDR/FWER for an incorrectly selected row (i.e., a row that contains no signal). The naive approach exceeded the nominal 0.05 level in all error measures considered (as expected, since this procedure does not account for selection). Figure 1 shows the actual level of each error measures for the three post selection analyses procedures. We see that BH-naive can have a high inflation of false positives. The inflation for an incorrectly selected row was so high that it was not plotted in row 3.
The fact that the inflation can be substantial clearly demonstrates that accounting for selection is necessary even when the selection criterion is very stringent. The fourth row of Table S1 in the SM shows the exact levels of the error measures for BH-naive for the setting in the last row of Figure 1 when µ = 4: on a correctly selected row (i.e., a row that contains signal), the conditional FDR was 0.065; on an incorrectly selected row the conditional FDR level was 0.996; and the average FDR was 0.074.
The results on the power show that the conditional approach is best when (1) many correctly selected rows are expected to contain signal in more than two columns; and (2) the number of rows selected out of the large number of rows examined is expected to be small (Figure 1). Note that in this simulation the average power is the same as the average power within a non-null row, and they are equal to the probability of discovering a true signal (i.e., rejecting a single non-null hypothesis), since we generated the signal (i.e., non-null test statistics) from the same distribution, and the same number of signals (i.e., non-nulls) in each of the m 1 rows. Although the naive procedure should not be used due to its unacceptable inflation of false positives, we plot its power so it can serve as a benchmark for the power loss due to the necessary adjustment for selection. Examining the power of BH-BB and BH-cond, we see that the conditional approach is more powerful than the approach of Benjamini and Bogomolov (2014) when (n, n 1 ) ∈ {(21, 7), (15, 5), (10, 4)}. The gain in power can be very large. For example, when µ = 3 the power difference between our approach and that of Benjamini and Bogomolov (2014) was greater than 40% for (n 1 , n) = (7, 21) and (n 1 , n) = (5, 15) and about 30% for (n 1 , n) = (4, 10). Our approach was slightly less powerful than the approach of Benjamini and Bogomolov (2014) for (n, n 1 ) = (10, 2). Additional results are provided in Supplemental Table S1.

Results for the dependence setting
The results on the average and conditional FDR control show that the control of false positives was tightest using the conditional approach. The conditional approach controlled all the error measures (as expected) at the nominal 0.05 level when the row selection was by Bonferroni (i.e., the threshold for selection was independent of the global null p-values). When the row selection was by BH at level q = 0.2 on the global null p-values, the average FDR and the conditional FDR for a non-null row were below the nominal 0.05 level. However, the conditional FDR for a null row was slightly above the nominal 0.05. When the row selection was by BH at level q = 0.05 instead of q = 0.2, no inflation was observed (results not shown).
The approach of Benjamini and Bogomolov (2014) controlled at the nominal 0.05 level the average FDR/FWER (as expected), as well as the conditional FDR/FWER for a correctly selected row (i.e., a row that contains signal), but did not control the conditional FDR/FWER for an incorrectly selected row (i.e., a row that contains no signal). The naive approach exceeded the nominal 0.05 level in all error measures considered. Figure 2 shows the actual level of each error measures for 19 (n, n 1 ) = (21, 7) (n, n 1 ) = (15, 5) (n, n 1 ) = (10, 4) (n, n 1 ) = (10, 2) (2) conditional FDR for a null row, i.e., a row that was selected despite the fact that all columns are null (the naive procedure does not appear because its value was above 0.80); (3) conditional FDR for a non-null row, i.e., a row that was correctly selected since it has signal in at least one column; and (4)  20 the three post selection analyses procedures.
As in the independence setting, the results on the power show that the conditional approach is best when (n, n 1 ) = (21, 7), but the approach of Benjamini and Bogomolov (2014) has slightly better power when (n, n 1 ) = (10, 2) ( Figure 2). Note that in this simulation the average power is different from the average power within a specific row, since the generated signal varied across rows.

Cross-tissue eQTL analysis in The Cancer Genome Atlas (TCGA) Project
Expression quantitative trait loci (eQTLs) are genomic regions with genetic variants that influence the expression level of genes. Identifying eQTLs is important for understanding biological mechanisms that controls various normal physiological processes and their aberrations that can lead to complex diseases. Because gene regulation is tissue specific, eQTL analysis is most informative using relevant tissue samples, which suffers frequently from the lack of statistical power because of the small sample size for the tissue. It is, however, observed that some eQTL SNPs are predictive of gene expression levels across multiple tissues and identification of such eQTLs could be facilitated by aggregated analysis across tissue types. A number of studies (Rivas et al., (2015) and Li et al., (2016), among others) have reported results based on such cross-tissue eQTL analysis using the data from the Genotype-Tissue Expression (GTEx) project. In this section, we illustrate the post-selection procedure in an eQTL analysis using 17 tumor tissues in The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/). We first performed an aggregated eQTL analysis across 17 tissue types to identify eQTL SNPs influencing the gene expression in at least one tissue type. For significant eQTLs, we performed post-selection inference to identify tissue types with the eQTL effect. We downloaded genotype and total gene expression data based on RNA sequencing from the TCGA website. The data quality control (QC) for genetic data and the normalization for the RNA-seq data are described in Supplementary Materials. After QC, 4,476 (n, n 1 ) = (21, 7), Row Selection by: (n, n 1 ) = (10, 2), Row Selection by: Bonferroni BH Bonferroni BH with row selection by BH at level q = 0.2 on {p iG , i = 1, . . . , 1100}; (n, n 1 ) = (10, 2) with row selection rule p iG ≤ 0.05/1100; and (n, n 1 ) = (10, 2) with row selection by BH at level q = 0.2 on {p iG , i = 1, . . . , 1100}. Each study had 100 blocks of size 11 rows, and in n 1 of the studies the first block contained signal. The dependence parameter was ρ = 0.7. From top to bottom: (1) average power (in black) and average power within the sixth row, which has the strongest signal in the block (in blue); (2) conditional FDR for a null row, i.e., a row that was selected despite the fact that all columns are null (the naive procedure does not appear because its value was above 0.80); (3) conditional FDR for the sixth row; and (4)  subjects with European ancestry and 19,285 genes were included for analysis. The sample size for each tissue type is reported in Supplementary Table S2. For the purpose of illustration, we performed cis-eQTL analysis (i.e., analyzing SNPs less than 1,000,000 base pairs from the target gene) using matrixeQTL (Shabalin, 2012), adjusting for sex, age and the top five principal component scores to eliminate the potential confounding due to population stratification. In total, we analyzed m = 7, 732, 750 SNP-gene pairs to identify cis-eQTLs. Because samples are independent across tissue types in TCGA, the p-values are independent across tissues.
For selection of eQTL SNPs (i.e., the rows), we combined p-values across the different tissue types (columns) by a Fisher style test suggested by Pearson, as described in Owen (2009): it runs the Fisher combining method for left-sided alternatives, and separately for right-sided alternatives, and takes the maximum of the two resulting statistics. The global null p-value is therefore where p L ij is the p-value when testing the left sided alternative for feature i in study j. This test has a strong preference for common directionality (i.e., it will have greater power than a test based on Fisher's combining method on two-sided p-values when the direction of the signal is consistent across tissues), while not requiring us to know the common direction. We identified 19,690 significant SNP-gene pairs using the global null p-value threshold 0.05/7, 732, 750 = 6.47 × 10 −9 based on the Bonferroni correction for FWER control at the 0.05 level. For each of the 19,690 SNP-gene pair, we proceeded to post-selection inference to identify relevant tissue types using our conditional approach. Using the BH procedure on selected rows, the median number of tissue discoveries was 6, with an inter-quartile range (IQR) of [4,8]. For comparison, we also applied the approach of Benjamini and Bogomolov (2014), which made far fewer discoveries: the median number of discoveries was 1, with IQR [0,2]. The conditional approach results in many more discoveries than the approach of Benjamini and Bogomolov (2014) for two reasons. First, because the number of selected SNPexpression pairs is far smaller than the number originally examined, 19,690/7,732,750=0.0025, and the approach of Benjamini and Bogomolov (2014) is more conservative the smaller this ratio is. Second, because for many SNP-expression pairs there are at least two tissues which are highly significant, thus the conditional p-values coincide with the original p-values, and there was essentially no cost for selection within these pairs.
An R implementation of the conditional p-value computation after selection by thresholding the global null p-values computed using (5.1) or (3.1) is available upon request from the first author (and will soon become available as a Bioconductor package).  Figure S2.
Thus, we propose a simple modification to our post-selection inference method to account for the second selection. We denote the recommended procedure for FDR control by BH-cond-MT, where MT stands for the additional Multiple Tests (of the SNPs that passed the first selection threshold at the aggregate level in the gene) that we need to correct for, after computing the conditional p-values.
Procedure 5.1. BH-cond-MT post-selection procedure for conditional FDR control at level α: 1. Select all SNP-expression pairs that reach the genome-wide significance threshold h(q, m) (e.g., h(q, m) = q/m if the Bonferroni correction is used for FWER control at level q on the global null hypotheses), S = {i : p iG ≤ h(q, m)}.
2. For each gene k with at least one selected SNP, select the most significant SNP in gene k, i k = arg min {i:i∈S,i in gene k} p iG . Let R k = |{i : i ∈ S, i in gene k}| be the number of SNPs selected in gene k.
3. Compute the conditional p-values as in (3.3) for row i k .
4. At level α/R k on the conditional p-values in row i k , apply the BH procedure.
For conditional FWER control, apply Bonferroni-Holm instead of BH in step 4 of Procedure 5.1.
In Supplemental Figure S3 we show that the procedure controls the FDR for dependencies across SNPs similar to those that arise in GWAS datasets within each study due to LD, and in Appendix D we formally prove that the conditional FDR is controlled when the rows are independent.
The impact of the second selection is illustrated in Table 2   , as well as the BH adjusted conditional p-values for the most significant SNP (column 6). In bold the adjusted p-values ≤ 0.05 4 , i.e., the significant discoveres using Procedure 5.1 with α = 0.05. Underlined are the adjusted p-values in (0.05/4, 0.05], i.e., discoveries using BH-cond that are not discoveries using Procedure 5.1 with α = 0.05. The conditional p-values were identical to the unconditional p-values in all four SNPs (since for each tissue j, the SNP would have been selected regardless of the value of the jth p-value). The global null test-statistic for each SNP is provided in the last row (the corresponding p-value was effectively zero). For example, among the 2235 selected SNPs, at least two tissue discoveries were made in 1857 of the genes by BH-cond-MT and only in 678 of the genes by BH-BB. For the BH procedure based on conditional p-values, accounting for second selection noticeably reduced the significant findings.
This was expected, since the number of SNPs discovered per gene is typically greater than one: out of the 19690 pairs that reached genome-wide significance, there were 2235 unique genes, and the number of SNPs per gene varied between 1 and 76, with a median number of 5.

Discussion
Results from both simulation studies and data analysis highlight the potential of the proposed method for valid and powerful hypothesis testing for detection of signals at the level of the finer units following selection of broader units using aggregate level test-statistics. Although the method is not as general as its existing competitor (Benjamini and Bogomolov , 2014), it can handle an Table 3: By each post-selection method, the number of genes with at least x discoveries of tissues, Therefore, n k=1 1 k P r (I = 1 and R = k | f (P 1 , p 2 , . . . , p n ) ≥ t, P 2 = p 2 , . . . , P n = p n ) = 1 J P r (P 1 ≤ αJ/n | f (P 1 , p 2 , . . . , p n ) ≥ t, P 2 = p 2 , . . . , P n = p n ) = α n .

30
C Proof of Theorem 3.3 Proof. Since E(C i | i ∈ S, P (i) ) ≤ α, it follows that The result is immediate: Proof. Relabel the rows so that the first m k rows are the SNPs for gene k.Let S k be the selection status for the SNPs in gene k, i.e., S k = {I[P iG ≤ h(q, m)], i in gene k}.
The procedure guarantees conditional error control for each SNP i 31 in gene k at level α/R k if the rows are independent: It therefore follows that the conditional error is controlled also for the row with the smallest global null p-value: