Application of post-selection inference to multi-omics data yields insights into the etiologies of human diseases

Ronald Yurko; Max G’Sell; Kathryn Roeder; Bernie Devlin

doi:10.1101/806471

Abstract

To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of post-selection inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive p-value thresholding (Lei & Fithian 2018) (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association p-values play the role of the primary data for AdaPT; SNPs are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: independent GWAS statistics from genetically-correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene-gene coexpression, captured by subnetwork (module) membership. In all 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations and it is especially apparent using gene expression information from the developing human prefontal cortex (Werling et al. 2019), as compared to adult tissue samples from the GTEx Consortium. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.

Large scale experiments, such as scanning the human genome for variation affecting a phenotype, typically result in a plethora of hypothesis tests. To overcome the multiple testing challenge, one needs corrections to simultaneously limit false positives while maximizing power. Introduced by Benjamini & Hochberg (1995), false discovery rate (FDR) control has become a popular approach to improve power for detecting weak effects by limiting the expected false discovery proportion (FDP) instead of the more classical Family-Wise Error Rate. The Benjamini-Hochberg (BH) procedure was the first method to control FDR at target level α using a step-up procedure that is adaptive to the set of p-values for the hypotheses of interest (Benjamini & Hochberg 1995). Other methods for FDR control have led to improvements in power over BH by incorporating prior information, such as by the use of p-value weights (Genovese et al. 2006). In the “omics” world – genomics, epigenomics, proteomics, and so on – the challenge of multiple testing is burgeoning, in part because our ability to characterize omics features grows continually and in part because of the realization that multiple omics are required for describing phenotypic variation. One might imagine merging complementary omics data and tests using a priori hypothesis weights to improve power; however, until recently, it was not clear how to choose these weights in a data driven manner.

Recent methodologies have been proposed to account for covariates or auxiliary information while maintaining FDR control (Scott et al. 2015, Ignatiadis et al. 2016, Boca & Leek 2018, Li & Barber 2019, Zhang et al. 2019). We implement a selective inference approach, called adaptive p-value thresholding procedure (Lei & Fithian 2018, AdaPT), to fully explore prior auxiliary information while maintaining guaranteed finite-sample FDR control. In a recent review, Korthauer et al. (2019) compared the performance of AdaPT with other covariate-informed methods for FDR control in simple one and two-dimensional covariate examples. One of the weaknesses they ascribe to AdaPT is the unintuitive modeling framework for incorporating covariates; however, we fully embrace AdaPT’s flexibility via gradient boosted trees in a much richer, high-dimensional setting. Our boosting implementation of AdaPT easily scales with more covariates, enabling practitioners to capture interactions and non-linear effects from the rich resources of prior information available. In this manuscript, we demonstrate our gradient boosted trees implementation of AdaPT on results from genome-wide association studies (GWAS), incorporating covariates constructed from independent GWAS and gene expression studies. Specifically, we apply AdaPT to GWAS for detecting single nucleotide polymorphisms (SNPs) associated with schizophrenia (SCZ) using bipolar disorder (BD) GWAS results from an independent sample as a covariate. Additionally, we incorporate results from the recent BrainVar study to identify a set of expression-SNPs (eSNPs) based on 176 neurotypical brains, sampled from pre-and post-natal tissue from the human dorsolateral prefrontal cortex (Werling et al. 2019). Along with the genetically correlated BD z-statistics, we create additional features from this complementary data source by summarizing the associated developmental gene expression quantitative trait loci (eQTL) slopes and membership in gene coexpression networks. We demonstrate that this process of identifying an enriched set of eSNPs and applying AdaPT with covariates summarizing gene expression from the developing human prefrontal cortex yields substantial improvement over the same pipeline of analysis applied to adult tissue samples from the GTEx Consortium (2015). Furthermore, we see improvements in the discovery rate with each additional piece of information from the BrainVar study and validate the replication of our results using more recent, independent SCZ studies.

This study had two goals, to explore the use of AdaPT in a realistic high-dimensional multiomics setting and to determine what can be learned about the neurobiology of SCZ by this exploration. Our results revealed the power of incorporating auxiliary information with flexible gradient boosted trees. While each covariate independently provided at best a modest increase in power, our adaptive search discovered a more complex model with far greater power. These discoveries also led to greater support for the polygenic basis of SCZ, complementing recent findings, suggesting that there are many physiological avenues to its underlying neurobiology. We emphasize that the process and analysis undertaken with this implementation of AdaPT can be extended to a variety of “omics” and other settings to utilize the rich contextual information that is often ignored by standard multiple testing corrections. We highlight this feature by analyzing two other sets of GWAS studies, type 2 diabetes (T2D) and body mass index (BMI), using results from these analyses to interpret findings from SCZ.

Results

Methodology overview

AdaPT is an iterative search procedure, introduced by Lei & Fithian (2018), for determining a set of discoveries/rejections, ℛ, with guaranteed finite-sample FDR control at target level α under conditions outlined below. We apply AdaPT to the prepared collection of p-values and auxiliary information, (p_i, x_i)_i∈n, testing hypothesis H_i regarding SNP i ‘s association with the phenotype of interest (e.g. SCZ). The covariates from some feature space, x_i ∈𝒳, capture information collected independently of p_i, but potentially related to whether or not the null hypothesis for H_i is true and the effect size under the alternative. AdaPT provides a flexible framework to incrementally learn these relationships, potentially increasing the power of the testing procedure, while maintaining valid FDR control.

For each step t = 0, 1, … in the AdaPT search, we first determine the rejection set ℛ_t = {i: p_i ≤ s_t(x_i)}, where s_t(x_i) is the rejection threshold at step t that is adaptive to the covariates x_i. This provides us with both the number of discoveries/rejections R_t = |ℛ_t|, as well as a pseudo-estimate for the number of false discoveries A_t = |{i: p_i ≥ 1 - s_t(x_i)}| (i.e. number of p-values above the “mirror estimator” of s_t(x_i)). These quantities are used to estimate the FDP at the current step t,

If , then the AdaPT search ends and the set of discoveries ℛ_t is returned. Otherwise, we proceed to update the rejection threshold while satisfying two protocols:

1. updated threshold must be more stringent, s_t+1(x_i) ≤ s_t(x_i), ∀x_i ∈𝒳,

2. small and large p-values determining R_t and A_t are partially masked,

Under these protocols, the rejection threshold can be updated using R_t, A_t, and . The flexibility in how this update takes place is one of AdaPT’s key strengths and allows it to easily incorporate other approaches from the multiple testing literature, such as a conditional version of the classical two-groups model (Efron et al. 2001, Scott et al. 2015) with estimates for the probability of being non-null, π₁, and the effect size under the alternative, µ.

The algorithm proceeds by sequentially updating the threshold s_t+1(x_i) to discard the most likely null element in the current rejection region, as measured by the conditional local false discovery rate (fdr): i.e., is removed from ℛ_t. With the threshold updated, the AdaPT search repeats by estimating FDP and updating the rejection threshold until the target FDR level is reached .

This procedure guarantees finite-sample FDR control under independence of the null p-values and as long as the null distribution of p-values is mirror conservative, i.e. the large “mirror” counterparts 1 - p_i ≥ 0.5 are at least as likely as the small p-values p_i ≤ 0.5. To address the assumption of independence, we select a subset of weakly correlated SNPs detailed in Data, and additionally provide simulations in SI Appendix (Figures S18-S20) showing that AdaPT appears to maintain FDR control in positive dependence settings. However, one practical limitation we encounter with the FDP estimate in Equation 1 is observing p-values exactly equal to one. While this can understandably occur with publicly available GWAS summary statistics, p-values equal to one will always contribute to the estimated number of false discoveries A_t. This nuance can lead to a failure of obtaining discoveries at a desired target α, such as the reported AdaPT results by Korthauer et al. (2019) for multiple case-studies. However, we demonstrate in SI Appendix an adjustment to the p-values for T2D and BMI GWAS applications which alleviates this problem (Figures S10 and S13), but future work can explore modifications to the FDP estimator itself.

For the modeling step of AdaPT, which estimates conditional local fdr, we use gradient boosted trees, which constructs a flexible predictive function as a weighted sum of many simple trees, fit using a gradient descent procedure that minimizes a specified objective function. In our case, the two objective functions considered correspond to estimating the probability of a test being non-null and the distribution of the effect size for non-null tests. The advantage of this approach to function fitting is that it is invariant to monotonic variable transformations, automatically incorporates important variable interactions, and is able to handle a large number of potentially useful covariates without degrading significantly in performance due to the high dimensionality. In contrast, less effective methods might fail to capture useful information because the covariates are incorrectly scaled for a linear function, because the important information is only revealed through a combination of covariates or because the important signal is simply swamped by the number of possible predictors to search through. Our choice of method gives the flexibility to include many potentially useful covariates without being overly concerned about the functional form with which they enter or their marginal utility. In our implementation, we employ the XGBoost library (Chen & Guestrin 2016) to capitalize on its computational advantages.

Figure 1 displays the full pipeline of our implementation of AdaPT to GWAS summary statistics for SNPs using expression quantitative trait loci (eQTL) to choose the SNPs under investigation. Methods detail the EM algorithm used to model the conditional local false discovery rate with gradient boosted trees.

Figure 1:

Summary of AdaPT implementation on GWAS results for selected set of SNPs.

Data

Our investigation includes AdaPT analyses of published GWAS p-values, {p_i, i = 1, … n}, for body mass index (Locke et al. 2015, BMI), type 2 diabetes (Mahajan et al. 2018, T2D), and schizophrenia (Ruderfer et al. 2014, SCZ), but we focus our presentation on SCZ results. SCZ is a highly heritable, severe neuropsychiatric disorder. It is most strongly correlated, genetically, with another severe disorder, bipolar disorder (BD) (Lichtenstein et al. 2009, Cross-Disorder Group of the Psychiatric Genomics Consortium 2013). Because of this genetic correlation, reported z-statistics from BD GWAS, , can be used as informative covariates for determining the SCZ rejection threshold. We use the GWAS summary statistics reported by Ruderfer et al. (2014) available from the Psychiatric Genomics Consortium (PGC), with independent controls for BD and SCZ, as an application of our AdaPT implementation (combined 19,779 SCZ and BD cases with 19,423 controls). Results from more recent studies in Ruderfer et al. (2018) are used for replication analysis of our results (combined 53,555 SCZ and BD cases with 54,065 controls). However, the 2014-only studies from Ruderfer et al. (2014) are a subset of the all-2018 studies from Ruderfer et al. (2018). Although we do not have access to the raw genotype data, we use the fact that both papers report inverse variance-weighted fixed effects meta-analysis results (Willer et al. 2010). We then separate the summary statistics for the 2018-only studies exclusive to Ruderfer et al. (2018), thus independent of the 2014-only studies and an appropriate hold-out to use for replication analysis.

After matching alleles from both 2014-only and all 2018 studies and limiting SNPs to those with imputation score INFO > 0.6 for both BD and SCZ in 2014-only (Ruderfer et al. (2014)), we obtained 1,109,226 SNPs. Rather than test all SNPs, we chose to investigate a selected subset of SNPs, eSNPs, whose genotypes are correlated with gene expression; this additional filtering step captures a set of SNPs that are more likely to be functional and not highly correlated mutually (Nicolae et al. 2010). These eSNPs were identified from two sources. First, we evaluated the Genotype-Tissue Expression (GTEx) V7 project dataset (GTEx Consortium 2015) with adult samples from fifty-three tissues. As the first winnowing step, we identified the set of GTEx eQTLs for any of the available tissues at target FDR level α = 0.05. Rather than use all GTEx eQTLs, however, we winnowed the eQTLs by selecting SNPs whose genotypes are most predictive of expression for each gene. These SNP-gene pairs yielded n_GTEx = 31,558 eSNPs.

The second source was the BrainVar study of dorsolateral prefrontal cortex samples across a developmental span (Werling et al. 2019). BrainVar included cortical tissue from 176 individuals falling into two developmental periods: pre-natal, 112 individuals; and post-natal, 60 individuals. We identified n_BrainVar = 25,076 eSNPs as any eQTL SNP-gene pairs provided by Werling et al. (2019) meeting Benjamini-Hochberg α ≤ 0.05 for at least one of the three sample sets (pre-natal, post-, and complete = all), resulting in a set of eSNPs of comparable size to the GTEx eSNPs. (Because of the source of the BrainVar eSNPs, we did not analyze these for either BMI or T2D.)

There are only 3,382 SNPs in the intersection set of the two considered definitions for SCZ eSNPs, approximately 10% and 13% of GTEx and BrainVar eSNPs respectively. This relatively minor overlap is likely driven by the temporal difference in when samples are taken for GTEx (adults) as compared to BrainVar (developmental). Figure 2 displays a comparison of SCZ enrichment for the full set of SNPs to both the n_GTEx and n_BrainVar eSNPs. The BrainVar eSNPs appear to display the highest level of SCZ enrichment and are the primary focus of this manuscript.

Figure 2:

A comparison of qq-plots revealing SCZ enrichment for both BrainVar and GTEx eSNPs compared to the full set of SNPs from 2014 studies.

For each eSNP i, we created a vector of covariates x_i to incorporate auxiliary information collected independently of p_i, including p-values from GWAS studies of related phenotypes, and relationships inferred from gene expression studies. First, we utilize the mapping of eSNPs to genes derived from eQTLs assessed in a relevant tissue type r. Let denote the set of cis-eQTL genes associated with eSNP i and summarize the level of expression as the average absolute eQTL slope for variants in to obtain . Additionally, we account for gene co-expression networks as covariates using the J modules generated with weighted gene co-expression network analysis (Zhang & Horvath 2005, WGCNA). For each of the j = 1, …, J WGCNA modules we create an indicator variable denoting whether or not eSNP i has any associated cis-eQTL genes in module j.

For the n_BrainVar eSNPs, we calculate where type ∈ {pre, post, complete} to capture the eSNP’s overall expression association across three different points in the developmental span. Additionally, we use the J = 20 WGCNA modules (including unassigned gray) reported by Werling et al. (2019) to create indicator variables for j = 1, …, 20. This culminates in a vector of twenty-four covariates .

For a parallel analysis we create a vector of covariate information for the n_GTEx eSNPs, calculating for r ∈ {GTEx cortical tissues} as well as indicator variables for WGCNA modules generated using GTEx cortical tissue samples. A full description of the variables considered for the set of GTEx eSNPs, along with T2D and BMI, is in SI Appendix.

AdaPT discoveries

We proceed to apply the AdaPT search algorithm to the SCZ p-values from the 2014-only studies to both types of eSNPs with their respective vector of covariates, and , capturing the correlation between BD and SCZ along with gene expression association and network summaries. At target FDR level α = 0.05, AdaPT returns R_GTEx = 23 and R_BrainVar = 843 discoveries for both the n_GTEx = 31,558 and n_BrainVar = 25,076 eSNPs respectively. As a baseline, we compare these results to an intercept-only version of AdaPT, which ignores covariates and was found to display a favorable performance in Korthauer et al. (2019). For the GTEx intercept-only results, 91 discoveries were returned versus 361 BrainVar intercept-only discoveries at α = 0.05. The Manhattan plots in Figures 3(A) and (B) compare the discovered BrainVar eSNPs from the intercept-only results to the fully informed AdaPT SCZ results with all twenty-four variables. The stark contrast between the expression data sources further reinforces the association between the SCZ GWAS results and gene expression in developmental periods measured in BrainVar, as compared to adult samples from GTEx. We examine the BrainVar results more closely for the remainder of the manuscript (see Figures S9 and S12 for T2D and BMI eSNP enrichment).

Figure 3:

Manhattan plots of SCZ AdaPT discoveries using (A) intercept-only model compared to (B) covariate informed model at target α = 0.05. (C) Comparison of the number of discoveries at target α = 0.05 for AdaPT with varying levels of covariates and (D) their resulting discovery intersections.

Although we focus on the discoveries using the twenty-four covariates described earlier in , for reference we additionally view the improvement in AdaPT’s performance on the BrainVar eSNPs by incrementally including more eSNP-level covariates. We start with the BD z-statistics:

BD z-stats: ,
BD z-stats + eQTL slopes: ,
BD z-stats + eQTL slopes + WGCNA: .

For each of set of covariates we tune the gradient boosted models (see SI Appendix and Table S1 for details on boosting parameters). Figure 3(C-D) displays the comparison in the number of discoveries between the different sets of covariates at target FDR level α = 0.05. The result yielding the highest number of discoveries is with all twenty-four covariates. For comparison, the results from only using the WGCNA modules, , are also displayed to show that the substantial improvement in performance results from using all three types of information together rather than from the impact of the WGCNA module indicators independently.

Variable importance and relationships

We examine the variable importance and partial dependence plots from the final gradient boosted models returned by AdaPT to provide us with insight into the relationships between each of the covariates considered and SCZ associations. Figures 4(A-B) display the variable importance plots for both the probability of being non-null (π₁) and effect size under alternative (µ) models respectively, displaying the relative contribution of each variable based on the total gain from its splits in the gradient boosted trees. We see similarities between both summaries, with the BD z-statistics appearing to be the most important.

Figure 4:

Variable importance plots for final AdaPT (A) probability of non-null and (B) effect size under alternative models. (C) Partial dependence plot for probability of being non-null and BD z-statistics. (D) SCZ enrichment of BrainVar eSNPs based on salmon WGCNA module membership.

Figure 4(C) displays the partial-dependence plot (Friedman 2001) for the estimated marginal relationship between the BD z-statistics and the probability of being non-null. This reveals an increasing likelihood for non-null results as the BD z-statistics grow in magnitude from zero, with noticeably sharp increases indicated by the dashed red lines around the nominal thresholds corresponding to BD p-values of 0.05. Figure 4(D) displays the clear enrichment for eSNPs with cis-eQTL genes that are members of the salmon WGCNA module reported by Werling et al. (2019), which also displayed relatively high variable importance. The unassigned gray module also displayed higher variable importance, however, this variable is predictive of SNPs that are classified as null, rather than associated with the phenotype. See SI Appendix for more partial dependence and WGCNA module enrichment plots (Figures S1-S3 for additional SCZ BrainVar results, and Figure S4 for SCZ GTEx).

Replication in independent studies

Next, we examine the replicability of the 2014-only SCZ BrainVar AdaPT results, using , by checking the nominal discovery replication of the SCZ p-values from the independent 2018-only studies. For simplicity, we consider an AdaPT discovery at target FDR level α = 0.05 to be a nominal replication if its corresponding p-value for the 2018-only studies is less than 0.05. Of the 843 discoveries from the 2014-only studies, approximately 55.2% (465 eSNPs) were nominal replications in the 2018-only studies.

This nominal replication rate is unsurprising given the “winner’s curse” phenomenon, as described in Lohmueller et al. (2003), and does not imply that the rejected hypotheses are actually null. To empirically evaluate this result, we use the final non-null effect size model estimated by the AdaPT search to generate simulated p-values p^sim using the observed 2018-only studies’ standard errors. We repeatedly generate the simulated p-values one-thousand times for the 843 2014-only discoveries, and calculate the nominal replication rate with p^sim. The nominal replication rate for the one-thousand simulations ranges from 51% to 64%, with an average of ∼ 57.2%, which provides reassurance regarding the observed rate. More details regarding the simulation process are provided in SI Appendix. Additionally, Figure 5 also displays the relationship between the 2018-only p-values and the resulting 2014-only q-values (Storey 2002) from the AdaPT search on the -log10 scale (see Lei & Fithian (2018) for derivation of AdaPT q-values). The black line represents the increasing smoothing spline relationship between the two, with noticeably increasing evidence indicated by the 2018-only p-values for the set of AdaPT discoveries at α = 0.05.

Figure 5:

Black line displays smooth relationship between SCZ p-values from 2018-only studies and the AdaPT q-values from the 2014-only studies. Blue-shaded region indicates AdaPT discoveries at α = 0.05 that are nominal replications, p-values from 2018-only studies < .05 while red region denotes discoveries which failed to replicate.

Gene ontology comparison

With SNP discoveries spanning the genome, including nearly every chromosome, we sought biological insights. We applied gene ontology enrichment analysis (Ashburner et al. 2000, The Gene Ontology Consortium 2018) to the 136 genes obtained from the eQTL variant-gene pairs associated with the 843 discoveries. This analysis produced no clear signal, yielding only a minor enrichment for biological processes related to peptide antigen assembly. Several explanations are plausible, we explore two: either AdaPT is discovering SNPs of such small effect that the discoveries are not meaningful or SCZ is a highly complex disorder with a large number of biological processes involved. For comparison we applied our full pipeline to GWAS summary statistics for T2D (Mahajan et al. 2018). This comparison is of interest because T2D is a disease with a well understood functional basis, and this is a well powered study with a sample size of 898,130 individuals (74,124 T2D cases and 824,006 controls). We restricted our analysis to 176,246 eSNPs based on eQTLs obtained using GTEx data. Next, we created eQTL-based covariates using pancreas, liver, and adipose tissue samples (see SI Appendix and Figures S9-S11 for more details on the implementation). After creating a vector of covariates from GTEx, AdaPT returned 14,920 eSNPs at α = 0.05, resulting in 5,970 associated genes. Applying gene ontology enrichment analysis to this gene list, we discovered enrichment for biological processes related to lipid metabolic process (see Figure 6), consistent with previous literature (Cirillo et al. 2018). These results provide some reassurance that the lack of specificity in the SCZ results can be attributed to the complex etiology of SCZ. For comparison to the well powered BMI GWAS (339,224 subjects), we found a lack of gene ontology enrichment in our gene discoveries, which are detailed in SI Appendix).

Figure 6:

T2D gene ontology enrichment analysis results for top ten biological processes based on positive fold enrichment.

Pipeline results for all 2018 studies

In addition to applying the pipeline to SCZ p-values from the 2014-only studies in Ruderfer et al. (2014), we also modeled p-values from all 2018 studies. The latter yields far more discoveries due to smaller standard errors from increased study sizes, even though the covariates were the same: for , we find 2,228 discoveries at target FDR level α = 0.05 when the pipeline was applied to the p-values for most up-to-date set of studies versus 843 for the 2014-only studies. Notably, the intercept-only version of AdaPT returned 1,865 discoveries at α = 0.05, meaning the covariates contributed to ∼ 19% increase in discovery rate for all 2018 studies versus the ∼ 134% increase (361 to 843 eSNPs) from using the covariates for the 2014-only studies (see Figures S5 and S6). This reinforces the value of using auxiliary information in studies with lower power. Complementary to this observation, AdaPT applied to BMI GWAS yielded more discoveries for the intercept-only version than covariate informed models (details presented in SI Appendix, see Figures S14-S15). Simply accounting for more auxiliary information does not guarantee an improvement in power and the advantages thereof diminishes as power increases, as witnessed by results for all 2018 studies for SCZ and the large-scale BMI GWAS. Additionally, the larger number of discoveries for the SCZ all 2018 studies, 2,228, maps onto 382 genes. Despite this increase, these genes did not reveal any clear signal from the Gene Ontology enrichment analysis, comporting with results from the 2014-only results.

Discussion

Our goals in this study were to explore the use of AdaPT for high-dimensional multi-omics settings and investigate the neurobiology of SCZ in the process. AdaPT was used to analyze a selected set of GWAS summary statistics for SNPs, together with numerous covariates. Specifically, SNPs were selected if they were documented to affect gene expression; these SNP-gene pairs were dubbed eSNPs. Covariates for these eSNPs included independent GWAS test statistics from a genetically correlated phenotype, BD, which were mapped to eSNPs through SNP identity; as well as features of gene expression and co-expression networks, which were mapped to eSNPs through genes. By coupling flexible gradient boosted trees with the AdaPT procedure, relationships among eSNP GWAS test statistics and covariates were uncovered and more SNPs were found to be associated with SCZ, while maintaining guaranteed finite-sample FDR control. The tree-based handling of covariates addresses a perceived weakness of AdaPT, namely the unintuitive modeling framework for incorporating covariates (Korthauer et al. 2019). The pipeline we built should be simple to mimic for a wide variety of omics and other analyses.

Regarding the neurobiology of SCZ, two important findings emerge. First, by comparing results from SCZ GWAS when the expression/coexpression covariates were derived from developing human prefrontal cortex versus those from adult tissue samples from GTEx, the former yielded notably better results. This comports with the development of SCZ itself: first break episodes of psychosis typically occur by early adulthood, somewhat later for women than for men. Furthermore, the results underscore the fact that we can learn more from genetic associations about the neurobiology of SCZ by studying a large-scale developmental series of brains than we can by studying adult tissue; regrettably, however, most brain tissue is obtained from adults, often relatively old adults.

A second point of interest regards the level of complexity underlying the neurobiology of SCZ. If the origins of SCZ arose by perturbations of one or a few pathways, we would expect to converge on those pathways as we accrue more and more genetic associations. On the other hand, if the ways to generate vulnerability to SCZ were myriad — even if there is an single ultimate cause shared across all cases — then we might expect no such convergence, at least with regards to the common variation assessed through GWAS. Gene ontology analysis of associated discovery genes from either the 2014-only or all 2018 studies reveals no enrichment for biological processes for SCZ. There are many possible explanations for these null findings, one of which is simply a lack of power or specificity of our results. However, the result stands in stark contrast to the results for T2D, for which the gene ontology analysis converges nicely on accepted pathways to T2D risk; yet they comport with those for BMI, which is known to have myriad genetic and environmental origins. Therefore our results are consistent with myriad pathways to vulnerability for SCZ, although it is impossible to rule out other explanations: for example, the possibility that we understand so little about brain functions that gene ontology analyses lack specificity. In any case, our results are consistent with two recent theories underlying the genetics of SCZ, namely extreme polygenicity (O’Connor et al. 2019) and “omnigenic” origins (Boyle et al. 2017).

Although the examples considered in this manuscript pertain to omics data, this process can be adapted for a large variety of settings. We demonstrate in SI Appendix (Figures S18-S20) simulations showing that AdaPT appears to maintain FDR control in positive dependence settings emulating linkage disequilibrium (LD) block structure underlying GWAS results. There is a clear need, however, for future work to explore AdaPT’s properties and computational challenges under various dependence regimes. More insight can help determine its appropriateness for improving power versus other approaches, such as those with known structural constraints (Lei et al. 2017). The growing abundance of contextual information available in “omics” settings provides ample opportunity to improve power for detecting associations, using a flexible approach such as AdaPT, when addressing the multiple testing challenge.

Methods

Two-groups model

The most critical step in the AdaPT algorithm involves updating the rejection threshold s_t(x_i). Lei & Fithian (2018) use a conditional version of the classical two-groups model (Efron et al. 2001) yielding the conditional mixture density, where the null p-values are modeled as uniform (f₀(p|x) ≡ 1). They proceed to use a conservative estimate for the conditional local false discovery rate, , by setting 1 - π₁(x) = f (1|x).

We model the non-null p-value density with a beta distribution density parametrized by µ_i, where µ_i = 𝔼 [-log(p_i)], resulting in a conditional density for a beta mixture model,

In this form, we can model the non-null probability π₁(x_i) = 𝔼[H_i|x_i] and the effect size for non-null hypotheses µ(x_i) = 𝔼[-log(p_i)|x_i, H_i = 1] with two separate gradient boosted tree-based models. The XGBoost library (Chen & Guestrin 2016) provides logistic and Gamma regression implementations which we use for π₁(x_i) and µ(x_i) respectively.

There are two categories of missing values in these regression problems: H_i is never observed, and at each step t of the search, the p-values for tests {i: p_i ≤ s_t(x_i) or p_i ≥ 1 - s_t(x_i)} are masked as . Naturally, an expectation-maximization (EM) algorithm can be used to estimate both and by maximizing the partially observed likelihood. The complete log-likelihood for the conditional two-groups model is,

During the E-step of the d = 0, 1, … iteration of the EM algorithm, conditional on the partially observed data fixed at step t, , we compute both, where indicates how likely equals p_i for non-null hypotheses. The explicit calculations of and for both the revealed, , and masked p-values, , are available in the supplementary materials of (Lei & Fithian 2018).

The M-step consists of estimating and with separate gradient boosted trees, using pseudo-datasets to handle the partially masked data. In order to fit the model for π₁(x_i), we construct the response vector and use weights . Then we estimate using the first n predictions from a classification model using as the response variable with the covariate matrix (x_i)_i∈[n] replicated twice and weights . Similarly, for estimating we construct a response vector with weights , and again take the first n predicted values using the duplicated covariate matrix.

The conditional local fdr is estimated for each , and we follow the procedure detailed in Section 4.3 of Lei & Fithian (2018) to update the rejection threshold to s_t+1(x_i) by removing test from ℛ_t. A summary diagram of the EM algorithm is displayed in Figure 7.

Figure 7:

Summary of AdaPT EM algorithm.

AdaPT gradient boosted trees with CV steps

As a flexible approach for modeling the conditional local fdr, we use gradient boosted trees (Friedman 2001) via the open-source XGBoost implementation (Chen & Guestrin 2016). Gradient boosted trees are an ensemble of many “weak” learners with contributions for making predictions. Let ℱ be the space of functions containing regressions trees, then the sum-of-trees model can be written as, where each f_p ∈ℱ is an individual tree and we aim to minimize the objective function, where L is the loss function and Ω measures the complexity of each tree such as the maximum depth, regularization, etc. Chen & Guestrin (2016) detail the algorithms for fitting the model in an additive manner as well as determining the splits for each tree.

In order to tune the variety of parameters for gradient boosted trees within AdaPT, such as the number of trees P and maximum depth of each tree, we use the cross-validation (CV) approach recommended in Lei & Fithian (2018). If we are considering M different options of boosting parameters, then we evaluate each of the M choices during the modeling phase of the AdaPT search. At step t, we divide the data into K folds preserving the relative proportions of masked and unmasked hypotheses. Then for each set of boosting parameters m = 1, …, M:

For each fold k = 1, …K:
1. apply EM-algorithm from Figure 7 to estimate and using parameters m with data from folds {1, …, K}\{k},
2. compute expected-loglikelihood on hold-out set k using two-groups model parameters from m following convergence,
then compute total across hold-out folds: .

Finally we use the set of parameters in another instance of the EM algorithm to estimate and on all data.

Computational aspects of AdaPT

Practical decisions are necessary to implement the AdaPT search. In addition to the covariates and p-values (x_i, p_t,i)_i∈[n], an initial rejection threshold s₀(x_i) is required to begin the search. Rather begin the search with a high starting threshold, such as recommended by Lei & Fithian (2018), we instead begin the AdaPT search with . Our decision to lower the starting threshold is advantageous for multiple reasons. First, intuitively, this starts our search in the regime of interest for target level α = 0.05, where we would not expect to detect discoveries with larger p-values using this flexible multiple testing correction. Additionally, by lowering the starting threshold, more true information is available to the gradient boosted trees at the start of the AdaPT search. For instance, with the set of BrainVar eSNPs, 21,248 true p-values are immediately revealed with as compared to only 2,290 when . Simulations detailed in SI Appendix show that on average our choice for using a lower threshold results in higher power (Figure S16).

The most computationally intensive part of the procedure is updating the rejection threshold via the EM algorithm. Instead of updating the model for estimating fdr_t,i at each step of the search, we re-estimate every [n/20] steps as recommended by Lei & Fithian (2018). However, the inclusion of the previously described K-fold CV procedure (we use K = 5) for tuning the gradient boosted trees obviously adds computational complexity to the AdaPT search, and would be expensive to apply every time the model fitting takes place. Rather, we apply the CV step once at the beginning, and then another time half-way through the search based on the similarity of simulation performance with varying number of CV steps in SI Appendix (Figure S17). Additionally, one needs to choose the potential M model parameter choices. Technically, unique combinations can be used for both models, π₁ and µ, but for simplicity we only consider matching settings for both models, i.e. both models have the same number of trees and maximum depth (see SI Appendix and Tables S1-S3). As a reminder, AdaPT guarantees finite-sample FDR control regardless of potentially over-fitting to the data when using the CV procedure. Simulations are provided in SI Appendix (Figure S21) showing how extensively increasing the number of trees P leads to decreasing power, but maintains valid FDR control.

We provide a modified version of the adaptMT R package to implement the AdaPT-CV tuning steps with XGBoost models at https://github.com/ryurko/adaptMT.

Supporting Information Text

S1 GTEx covariates for SCZ

As referred to in Data, for each of the n_GTEx eSNPs, we calculated for each r^cortical ∈ {GTEx cortical tissues}. Specifically, there are two cortical tissues available from GTEx: (1) anterior cingulate cortex BA24 and (2) frontal cortex BA9. GTEx additionally has duplicate tissue measurements for frontal cortex BA9 referred to as cortex. However, the cortex tissue samples are from the same time as the other non-brain tissue samples. Instead, we used the data corresponding to the frontal cortex BA9 tissues, since these samples were extracted the same time as anterior cingulate cortex BA24 at the University of Miami Brain Endowment Bank, preserved by snap freezing (see GTEx FAQs).

In addition to calculating these two summaries, we also calculated an aggregate across both cortical tissues . When calculating the two individual cortical tissue sample and aggregate summaries, if eSNP i was not an eQTL for a particular tissue region (e.g. , then we impute a value of zero reflecting the lack of associated expression.

We applied WGCNA (Zhang & Horvath 2005, WGCNA) to GTEx data to create a set of module indicator variables, , denoting whether or not eSNP i has any cis-eQTL genes in module j_cortical for the WGCNA results from both cortical tissue samples, anterior cingulate cortex BA24 and frontal cortex BA9. To generate the WGCNA results, we only consider protein coding genes identified using the grex package in R (Xiao et al. 2018, R Core Team 2018). Additionally, all genes with expression levels of zero for over half of the provided samples were removed resulting in protein coding genes.

Unsigned network results were generated via the WGCNA package (Langfelder & Horvath 2008) with the default settings, using average linkage hierarchical clustering of the topological overlap dissimilarity matrix and the hybrid adaptive tree cut to generate the modules (Langfelder et al. 2008, Langfelder & Horvath 2012). Including the unassigned gray module, the cortical tissues WGCNA results resulted in thirteen modules. Thus for each of the eSNPs, we created a vector of seventeen covariates to use in AdaPT via gradient boosted trees, using indicator variables denoting cis-eQTL membership for each of WGCNA modules along with the summaries of expression association and BD z-statistics.

S2 SCZ variable importance and partial dependence

We explore further the resulting variable relationships from the final gradient boosted trees returned by AdaPT for the BrainVar eSNPs. First, Figure S1 displays the partial dependence plot of the effect size under the alternative on BD z-statistics, yielding a similar relationship to Figure 4(C) from Results. Figure S2 displays the partial dependence plots for each of the three BrainVar eQTL slope summaries based on final gradient boosted trees returned by the AdaPT results for the BrainVar eSNPs. Figures S2(A-C) display the relationships for the probability of non-null model, while (D-F) display relationships for the effect size under the alternative. As partial dependence plots suffer in high dimensions, we can still see general trends consistent with the variable importance plots from Figure 4(A-B) such as the stronger marginal relationship for the variable derived from the complete sample eQTL slopes.

Additionally, in Figure S3 we display the p-value distributions comparing the enrichment for membership in the different WGCNA modules reported by Werling et al. (2019). While many of the WGCNA modules lack clear evidence or contain too few eSNPs, as denoted by their respective y-axes, the cyan and salmon modules display noticeable enrichment. Additionally, as mentioned previously, membership in the gray module displays a lack of enrichment versus no associated cis-eQTL gene affiliated with the unassigned WGCNA module.

To contrast with the BrainVar eSNPs, Figures S4(A-B) display the resulting variable importance plots for the final models returned by AdaPT for the set of GTEx eSNPs. Similar to the BrainVar eSNPs, the BD z-statistics are the most important variables with their respective partial dependence plots for both AdaPT models displayed in Figures S4(C-D). Unlike the BrainVar eSNPs, there is no clear enrichment displayed by either the eQTL slope summaries nor the WGCNA module membership variables for the set of GTEx eSNPs.

S3 Replication simulations

We use simulations to empirically assess the observed nominal replication rate, percentage of discoveries with p-values less than 0.05 in holdout 2018-only studies, of 55.2% for the 843 SCZ discoveries from the 2014-only studies at target FDR level α = 0.05. We use the final non-null effect size model returned by the AdaPT, , to generate simulated p-values p^sim and nominal replication rates to compare the observed rate against. For the simulations, we assume that all 843 SCZ discoveries from the 2014-only studies are truly non-null, and we use the actual BrainVar eSNPs, their observed standard errors s₁₄, s₁₈ from the 2014-only and 2018-only studies respectively, as well as their actual covariates for generating p^sim. A single iteration of the simulation proceeds as follows:

Figure S1:

Partial dependence plot of non-null effect size on BD z-statistic. Vertical dashed lines denote z-statistics at +/- 1.96. Rugs along x-axis denote distribution of BD z-statistics.

Figure S2:

Partial dependence plots for probability of being non-null in (A-C), and the effect size under alternative in (D-F), for each type of BrainVar eQTL slope. Rugs along x-axis denote distribution of values for each variable.

For each of the R_BrainVar = 843 discoveries i ∈ℛ_BrainVar:
1. Assume test status is non-null: H_i = 1.
2. Generate effect size using final AdaPT model as truth:
3. Transform effect sizes to p-value .
4. Convert simulated p-value to z-statistic .
5. Calculate updated z-statistic to reflect observed reduction in standard error for 2018-only studies relative to 2014-only,
6. Convert updated z-statistic to p-value:
Calculate nominal replication rate using ,

Figure S3:

Comparison of SCZ p-value distributions from 2014 studies by whether or not the eSNP had an associated cis-eQTL gene in the module.

Figure S4:

Using GTEx eSNPs: variable importance plots for final AdaPT (A) probability of being non-null and (B) effect size under alternative models, as well as partial dependence plots for (C) probability of being non-null and (D) the effect size under alternative for BD z-statistics.

We repeat this process to generate one-thousand simulated values for the nominal replication rate. The distribution of the simulated values ranges from approximately 51% to 64%, with an average and median of ∼ 57%, close to the observed rate of 55.2%. Obviously, assuming that all of the 843 rejections are truly non-null is an overtly optimistic assumption given the use of FDR error control. Thus, the average simulated nominal replication rate of ∼ 57.2% is reassuringly close to the observed rate and likely higher than what would be expected if false discoveries were accounted for among the 843 considered eSNPs.

S4 SCZ results with all 2018 studies

We generate the AdaPT results using the SCZ p-values from all-2018 studies to the same set of n_BrainVar = 26, 076 eSNPs with the same covariates . As a comparison to the results displayed in Figure 3 using the 2014-only studies, Figures S5(A-D) display the same figures but with the results from all 2018 at target FDR level α = 0.05. In contrast to before, we see that due to the increase in power from the study size, the use of modeling the auxiliary information provides a much smaller increase in power with just an approximately 19% increase in discoveries from the intercept-only results (1,865 discoveries) to using all twenty-four covariates (2,228 discoveries).

We additionally examine for comparison the variable importance and partial dependence plots from the final gradient boosted models returned by AdaPT using all 2018 studies. Similar to before, Figures S6(A-B) display the variable importance plots for both the probability of being non-null and effect size under alternative models using the SCZ p-values from all 2018 studies respectively. The results are similar to before, but with the complete sample BrainVar eQTL slopes possessing the highest importance. The BD z-statistics are again highly important for all 2018 studies, displaying the similarly increasing relationship for both final AdaPT models as seen in the partial dependence plots in Figures S6(C-D). The partial dependence plots for the different BrainVar eQTL slopes summaries are seen in Figures S7(A-D). Figures S8 displays the levels of SCZ enrichment for all 2018 studies, revealing modules that are consistent with the 2014-only studies such as cyan and salmon.

Figure S5:

Manhattan plots of SCZ AdaPT discoveries with all 2018 studies using (A) intercept-only model compared to (B) covariate informed model at target α = 0.05. (C) Comparison of the number of discoveries at target α = 0.05 for AdaPT with varying levels of covariates and (D) their resulting discovery intersections.

Figure S6:

Variable importance plots for final AdaPT (A) probability of non-null and (B) effect size under alternative models. Partial dependence plot for both (C) probability of being non-null and (D) effect size under alternative with BD z-statistics

S5 Type 2 diabetes results

Using GWAS summary statistics for type 2 diabetes (T2D), unadjusted for BMI, available from Diabetes Genetics Replication And Meta-analysis (DIAGRAM) consortium (Mahajan et al. 2018), we applied our full pipeline outlined in Figure 1. Of the initial set of over twenty-three million SNPs available, we identified 176,246 eSNPs from eQTL variant-gene pairs from any GTEx tissue sample using the same definition as the GTEx eSNPs considered for the SCZ GWAS explained in Data. Figure S9 displays the enrichment for these GTEx eSNPs compared to the original set of SNPs from the T2D GWAS results.

Figure S7:

We create a vector of covariates summarizing expression level information from GTEx for pancreas, liver, and two adipose tissues, subcutaneous and visceral (omentum). Specifically, we calculate for each r^T2D in the set of tissues: pancreas, liver, adipose -subcutaneous, adipose -visceral (omentum). Additionally, we generate WGCNA module assignments using protein coding genes for pancreas samples from GTEx (using same settings described in GTEx covariates for SCZ), resulting in fourteen different modules (including the unassigned gray module). Unlike the SCZ applications, we do not use independent GWAS results from another phenotype.

Using defined above, we applied AdaPT to the 176,246 GTEx eSNPs. However, we encountered an issue for this data where we were unable to discover any hypotheses at target FDR level α ≤ 0.05. This was due to the fact that 640 eSNPs had p-values exactly equal to one. While this can understandably occur with publicly available GWAS summary statistics, p-values equal to one will then always contribute to the pseudo-estimate for the number of false discoveries A_t during the AdaPT search (see Methodology overview). With a relatively high number of p-values equal to one, AdaPT is unable to search through rejection sets for lower α values. To overcome this challenge, we draw random replacement p-values for the 640 eSNPs from a uniform distribution between 0.97 and 1 - 1E^-15, a value strictly less than one, to allow some leeway. We refer to this set of p-values as adjusted, while the original observed p-values are unadjusted. For comparison, Figure S10 shows the difference in the number of discoveries for the adjusted and unadjusted p-values across different target α values. Due to the similarity in performance for α values greater than 0.1, we use results for the adjusted p-values moving forward.

Figure S8:

Comparison of SCZ p-value distributions from all-2018 studies by whether or not the eSNP had an associated cis-eQTL gene in the module.

Figure S9:

A comparison of qq-plots revealing T2D enrichment for GTEx eSNPs compared to full set of SNPs.

Figure S10:

Comparison of the number of discoveries by AdaPT for T2D by whether or not the adjusted or unadjusted p-values were used.

At target FDR level α = 0.05, AdaPT yields 14,920 T2D discoveries using the adjusted p-values with covariates (compared to 14,693 intercept-only discoveries). The variable importance plots for the final T2D AdaPT models are displayed in Figure S11. This set of eSNPs is associated with 5,970 cis-eQTL genes for which we then applied gene ontology enrichment analysis to (Ashburner et al. 2000, The Gene Ontology Consortium 2018), identifying the gene enrichment for biological processes displayed in Figure 6.

Figure S11:

Variable importance plots for final T2D AdaPT (A) probability of being non-null and (B) effect size under alternative models.

S6 BMI results

We also applied our pipeline of analysis to BMI, unadjusted for waist-to-hip ratio (WHR), using GWAS results for individuals of European ancestry available from the GIANT Consortium. Specifically, we approached BMI in the same manner as SCZ: apply AdaPT to GWAS results from earlier studies with a sample size of 322,154 individuals (Locke et al. 2015); then compare the nominal replication results on recently conducted studies with a sample size of approximately 700,000 individuals (Yengo et al. 2018). As before, all of the 2015-only studies from Locke et al. (2015) were included as a subset of all 2018 studies (Yengo et al. (2018)). Because both Locke et al. (2015) and Yengo et al. (2018) use the inverse variance-weighted fixed effects approach for meta-analysis, we then compute statistics for the studies exclusive to 2018-only studies in Yengo et al. (2018). Additionally, to make this example more comparable to the SCZ use, we also use GWAS results for WHR (Shungin et al. 2015) as a covariate (analogous to BD for SCZ). Following pre-processing steps (matching SNPs across studies and effect alleles in both WHR and BMI), we identified 47,690 GTEx eSNPs from a set of nearly two million SNPs, based on the definition explained in Data. Figure S12 displays the enrichment for the GTEx eSNPs compared to the original set of pre-processed SNPs for the 2015-only studies.

Figure S12:

Comparison of qq-plots revealing BMI enrichment for GTEx eSNPs compared to full set of SNPs.

Based on previous knowledge of BMI tissue expression associations (Locke et al. 2015), we create a vector of covariates summarizing expression level information from GTEx for brain and adipose tissues (both subcutaneous and visceral (omentum)). Specifically, we calculate for each r^BMI ∈ {GTEx brain tissues, adipose -subcutaneous, adipose -visceral (omentum)}, where we consider the following brain tissues: (1) amygdala, (2) anterior cingulate cortex BA24, (3) caudate basal ganglia, (4) cerebellar hemisphere, (5) frontal cortex BA9, (6) hippocampus, (7) hypothalamus, (8) nucleus accumbens basal ganglia, (9) putamen basal ganglia, (10) spinal cord cervical c-1, and (11) substantia nigra. We do not consider the available cerebellum cortex tissue samples from GTEx as these are duplicates of cerebellar hemisphere and frontal cortex BA9 respectively. We instead only use the samples taken the same time as the other brain sub-regions at the University of Miami Brain Endowment Bank, preserved by snap freezing (see GTEx FAQs).

Wwe also created an aggregate across , all cis-eQTL genes associated with eSNP i for each non-cerebellar hemisphere brain tissue region r^nc,

We did not include the cerebellum tissue samples in this aggregate due to the reported distinctness of the cerebellum relative to other brain tissue samples (GTEx Consortium 2015). Similarly, we computed an average across the two adipose tissues. As before, when calculating the various eQTL slopes summaries, if eSNP i was not an eQTL for a particular region then we impute a value of zero reflecting the lack of associated expression.

Furthermore, WGCNA module assignments were generated using protein coding genes for three different sets of tissues: (1) all non-cerebellar hemisphere brain tissues, (2) cerebellar hemisphere only tissue, and (3) adipose tissues (using same settings described previously in GTEx covariates for SCZ). Together with the WHR z-statistics and covariates accounting for the associations and WGCNA module indicators, contained 110 variables.

For BMI eSPS, 376 have p-value exactly equal to one, leading to the same problem as we encountered in the T2D analysis. Again, we proceed by randomly drawing replacement p-values for these 376 eSNPs from a uniform distribution between 0.97 and 1-1E^-15. Figure S13 shows how AdaPT fails to obtain any discoveries across the various α levels without making an adjustment to the p-values. With this limitation recognized, we proceed to focus on the discoveries returned by AdaPT using the adjusted p-values at α = 0.05.

Figure S13:

Comparison of the number of discoveries by AdaPT for BMI by whether or not the adjusted or unadjusted p-values were used.

Figure S14:

Black line displays smooth relationship between BMI p-values from 2018-only studies and the AdaPT q-values from the 2015-only studies. Blue-shaded region indicates AdaPT discoveries at α = 0.05 that are nominal replications, p-values from the 2018-only studies < 0.05 while red denotes discoveries which failed to replicate.

Unlike SCZ and T2D, AdaPT using all of the covariates detected fewer discoveries: 1,383 eSNPs compared to 1,624 eSNPs discovered by the intercept-only AdaPT model at target FDR level α = 0.05. Of these 1,383 discoveries, approximately 83% (1,140 eSNPs) were nominal replications with p-values less than or equal to 0.05 in the independent 2018-only studies. Figure S14 displays the increasing smoothing spline relationship between the 2018-only p-values and the resulting 2015-only q-values from the AdaPT search on the log₁0 scale. The much higher observed nominal replication rate is not surprising given the well powered size of the BMI studies, as indicated by the y-axis of Figure S14, which reflects the level of enrichment for the 2018-only studies.

Additionally, gene ontology enrichment analysis for the 1,383 discoveries using all covariates revealed no significant biological process enrichment at target FDR level α = 0.05. One concern is that a model with 110 variables is excessive, because the variable importance plots for the final BMI AdaPT models in Figures S15(A-B), along with the partial dependence plots in Figures S15(C-D), emphasize the relative importance of the WHR z-statistics compared to other covariates. To test this conjecture, we explored two simpler models using (1) WHR z-statistics only and (2) WHR z-statistics with eQTL slope summaries. These produced 1,324 and 1,332 discoveries at the 0.05 level, respectively. We conclude that the available covariates do not provide sufficient additional information beyond the signal available with this immense sample and consequently including covariates in the AdaPT model does not increase the power of the procedure.

Figure S15:

Variable importance plots for final BMI AdaPT (A) probability of being non-null and (B) effect size under alternative models. Partial dependence plot for both (C) probability of being non-null and (D) effect size under alternative with WHR z-statistics.

S7 CV tuning for SCZ, T2D, and BMI results

Rather than fixing the parameter settings for the XGBoost gradient boosted trees, we use the CV algorithm (detailed in Methods) at two steps of the search to tune the models (see the following section for justification of using two CV steps). For our search space, we evaluate a small range of values for the number of trees P and limit the maximum tree depth D to result in reasonably shallow trees (referred to as nrounds and max depth in the xgboost package (Chen et al. 2019)).

First, when exploring the improvement in discovery rate for the BrainVar eSNPs by incrementally including more information, we used the following XGBoost settings:

BD z-stats: Fixed D = 1, varied P ∈ {50, 100, 150},
BD z-stats + eQTL slopes: Combinations of P ∈ {100, 150}, D ∈ {1, 2},
BD z-stats + eQTL slopes + WGCNA: Combinations of P ∈ {100, 150}, D ∈ {2, 3},
WGCNA only: Combinations of P ∈ {100, 150}, D ∈ {1, 2, 3}.

We explored different settings for the different possible covariates to address the types of variables included. For instance, when using the BD z-statistics only, we considered single-split “stumps” to model the BD z-statistics relationships in purely an additive sense because it is a univariate example. Once we have all three types of covariates (BD z-statistics, eQTL slope summaries, and WGCNA results), we limit the maximum depth to be at least two to ensure possible interactions can be captured.

The selected number of trees P and maximum depth D for each of these sets of covariates is displayed in Table S1. For each set of the covariates, the most complex settings (largest number of trees and largest depth) are selected in both CV steps. This agreement in selection is not surprising given the choice of the low starting threshold s₀ = 0.05, which differs from the results displayed in Table S3 of the next section using s₀ = 0.45. We evaluated the same possible settings for the various all 2018 results displayed in Figures S5(C-D): the same choices for P and D displayed in Table S1 were selected in both CV steps.

For the SCZ results with the GTEx eSNPs using all covariates , as well as the results for T2D and BMI with their full set of covariates, we evaluated four combinations: (1) P = 100, D = 2, (2) P = 150, D = 2, (3) P = 100, D = 3, and (4) P = 150, D = 3. For the BMI results using only WHR z-statistics, we fixed D = 1 and varied P ∈ {50, 100, 150}; for the results using WHR z-statistics with the eQTL slopes, we used combinations of P ∈ {100, 150}, D ∈ {1, 2}. The selected number of trees P and maximum depth D for each of these sets of AdaPT results at both CV steps is displayed in Table S2.

View this table:

Table S1:

Selected boosting settings for number of trees P and maximum depth D with AdaPT CV algorithm by covariates for BrainVar eSNPs in each CV step.

View this table:

Table S2:

Selected boosting settings for number of trees P and maximum depth D with AdaPT CV algorithm by GWAS results in each CV step.

S8 Selection of s₀ and number of CV steps

To justify the selection of both the starting threshold s₀ and number of CV steps for the AdaPT search, we generated simulations from the final AdaPT models returned from the SCZ 2014-only results. While these models are based on AdaPT results with a starting threshold of s₀ = 0.05 and the use of two CV steps, they are only from the final model and are not explicitly parametrized by s₀ and the number of CV steps. We know, however, that these final models are the result of using P = 150 trees with a maximum depth of D = 3, as indicated in Table S1 of the previous section.

Let and be the final models for the probability of non-null and effect size under the alternative that AdaPT returns for the BrainVar eSNPs using all covariates . We use these models as the “truth” for generating data, in which a single iteration of the simulation proceeds as follows:

For each BrainVar eSNP
1. Generate test status: .
2. Generate simulated effect sizes:
3. Transform to p-values p_i.
Apply AdaPT to simulated study p-values with specified s₀ and v CV steps with two candidate settings:
1. number of trees P = 100 and maximum depth D = 2,
2. number of trees P = 150 and maximum depth D = 3.
Compute observed power and FDP at range of target FDR α values.

We generate one-hundred simulations this way for each possible threshold s₀ ∈ {0.05, 0.25, 0.45} and v ∈ {1, 2, 5} CV steps. Figure S16 displays the average difference in power between the different starting threshold values by the number of CV steps. Although the differences are small, we see that using s₀ = 0.05 results in higher power, on average, than both 0.25 and the recommended 0.45 value. Using this low starting threshold of s₀ = 0.05, we then directly compute the difference in power between the different number of CV steps displayed in Figure S17. Unsurprisingly, while again the differences are small, only one CV step results in the lowest power, on average. Since the computational cost of AdaPT with CV tuning is reduced by only using two CV steps instead of a higher number, such as five, and the simulations demonstrate on average no difference in power at both α values of 0.05 and 0.10, we use the starting threshold of s₀ = 0.05 with two CV steps in our applications of AdaPT.

In the previous section, Table S1 displayed the selections in both CV steps with s₀ = 0.05. For comparison, Table S3 displays the selections using s₀ = 0.45. Instead of selecting the same settings in both steps, the higher initial threshold selects the least complex settings (smallest number of trees and minimum depth) in the first CV step before flipping to the most complex settings in the second step. Intuitively, the higher initial threshold means more information is masked from the models, so it is not surprising to see less complex settings chosen. This further reinforces the use of the lower initial threshold s₀ = 0.05: it starts with more revealed information and selects model settings corresponding to improved CV performance for tests with lower p-values of interest.

View this table:

Table S3:

Selected boosting settings for number of trees P and maximum depth D with AdaPT CV algorithm by covariates for BrainVar eSNPs with s₀ = 0.45.

Figure S16:

Difference in simulation power between different initial thresholds s₀ for AdaPT search by number of CV steps. Points denote averages with plus/minus two standard error bars.

Figure S17:

Difference in simulation power between the number of CV steps with s₀ = 0.05. Points denote averages with plus/minus two standard error bars.

S9 Dependent p-value block simulation

To demonstrate the performance of AdaPT in the presence of dependent tests, we construct simulations with a block-correlation scheme to emulate LD structure for SNPs. We consider a setting with two independent covariates,

For each test i ∈ [n], we define a linear relationship for the log-odds of being non-null using these covariates,

Then, the resulting status of the test H_i is a Bernoulli random variable based on the probability π_1,i(x_i) where H_i = 1 indicates the test i is non-null while H_i = 0 indicates a true null,

Given this test status, a vector of true effect sizes µ = c(µ_i, …, µ_n) is also generated as a function of the covariates,

To simulate observed effect sizes, we construct an n × n covariance matrix Σ with B blocks of equal size . Each block b ∈ [B] has constant correlation ρ between all tests within the block, while each block is independent of each other. This results in constructing individual block covariance matrices, Σ_b, with ones along the diagonal and ρ for the off-diagonal elements. Each of these individual matrices are placed along the diagonal of Σ, with the remaining off-diagonal elements set to zero so blocks are independent of each other. As an example, if each block contained only two tests they would be constructed in the following manner,

Using this block-wise construction of the covariance matrix, we then proceed to generate the vector of observed effect sizes z = (z_i, …, z_n) from a multivariate Gaussian distribution,

We compute the resulting two-side p-value p_i = 2 · F(-|z_i|) for each test’s observed effect size.

For each dataset generated using this process above, we compute both the observed FDP and power for the classical BH procedure and two different versions of AdaPT:

intercept-only,
gradient boosted trees with covariates: x_i = (x_i1, x_i2).

We fix both n = 10,000 and B = 500 blocks, resulting in 500 blocks of twenty tests each. Rather than force all non-nulls together in the same blocks, we first calculate the minimum number of blocks required to hold all non-null tests, . The non-null tests are then randomly assigned to blocks, ensuring that there will be blocks containing both null and non-null tests. The |{i: H_i = 0}| tests are randomly assigned to available spots within the B_A blocks as well as the remaining 500 - B_A strictly null blocks.

In our simulations, we fix β₀ = -3 and require that both β₁ = β₂ and γ₁ = γ₂. We vary the following settings in our simulations:

block correlation ρ ∈ {0, 0.25, 0.5, 0.75, 1} where each block has the same value for ρ,
β₁, β₂ ∈ {1, 2, 3},
µ_floor ∈ {0.5, 1, 1.5},
γ₁, γ₂ ∈ {0.5, .75, 1}.

We generate 100 simulations using the data generating process above, computing both the FDP and power for BH and the two different versions of AdaPT. For the covariate-informed version of AdaPT, we use gradient boosted trees via XGBoost with P = 100 trees and maximum depth D = 1. For both versions of AdaPT results, we start with the initial threshold of s₀ = 0.45 and update the model ten times throughout the search (rather than the recommended twenty for computational speed).

Figures S18, S19, and S20 display points for the average observed FDP and power across the 100 simulations with plus/minus two standard errors bars for µ_floor =0.5, 1, and 1.5 respectively, with target FDR level α = 0.05. The columns in each figure correspond to the different values considered for γ₁ = γ₂, while the rows correspond to β₁ = β₂. The x-axis for the figures displays the increasing block correlation ρ. Regardless of the simulation setting, we see that the AdaPT results when accounting for covariates (x_i1, x_i2) maintains valid FDR control at 0.05 similar to BH. This holds in the settings with greater effect sizes, as well as when the covariate information displays the best performance in terms of observed power (the bottom right panels of each figure). We can see that the intercept-only approach fails to achieve FDR control under block settings with perfect correlation, while the use of covariate information appears to inhibits such behavior. Our focus on positive correlation values is synonymous with the setting faced in genomics regarding LD structure. Further exploration of AdaPT’s performance in settings with arbitrary dependence structure presents an opportunity for future work, as well as accounting for covariate information that predict observed correlated noise.

S10 Simulations demonstrating effects of overfitting

It is possible that flexible methods like gradient boosted trees can be overfit, especially on small data sets. This could potentially lead to concerns about their incorporation in AdaPT. To assess the effects of overfitting the gradient boosted trees in AdaPT, we constructed simulated datasets using the finals models returned by AdaPT on the SCZ GWAS results, and , with the actual covariates for each of the eSNPs. We then simulated data using these models in the same manner previously explained for choosing s₀ and the number of CV steps, and computed the observed power and FDP over a range of number of trees P ∈ {100, 300, 500, 700, 900}.

Figure S21(A) displays the distributions for fifty simulations of the observed FDP as the number of trees in the gradient boosted model increases. Regardless of the number of trees, we still maintain valid FDR control. However, Figure S21(B) shows as the number of trees increases, the method will overfit, resulting in a reduction in power. This reinforces that, although good model tuning can be important for power, the AdaPT method continues to maintain FDR control even as the model breaks down.

Figure S18:

Comparison of average (A) FDP and (B) power with plus/minus two standard error bars for 100 simulations with µ_floor = 0.5, and varying values for β₁ (rows) and γ₁ (columns) and block correlation ρ.

Figure S19:

Comparison of average (A) FDP and (B) power with plus/minus two standard error bars for 100 simulations with µ_floor = 1, and varying values for β₁ (rows) and γ₁ (columns) and block correlation ρ.

Figure S20:

Comparison of average (A) FDP and (B) power with plus/minus two standard error bars for 100 simulations with µ_floor = 1.5, and varying values for β₁ (rows) and γ₁ (columns) and block correlation ρ.

Figure S21:

Distributions of observed (A) FDP and (B) power for simulations as the number of AdaPT gradient boosted trees increases by target FDR level α. Points denote averages with plus/minus two standard error intervals.

Footnotes

https://github.com/ryurko/adaptMT

References

↵
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000), ‘Gene ontology: tool for the unification of biology’, Nature Genetics 25(1), 25–29. URL: https://doi.org/10.1038/75556
OpenUrl CrossRef PubMed Web of Science
↵
Benjamini, Y. & Hochberg, Y. (1995), ‘Controlling the false discovery rate: A practical and powerful approach to multiple testing’, Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 289–300. URL: http://www.jstor.org/stable/2346101
OpenUrl CrossRef Web of Science
↵
Boca, S. M. & Leek, J. T. (2018), ‘A direct approach to estimating false discovery rates conditional on covariates’, PeerJ 6, e6035. URL: https://doi.org/10.7717/peerj.6035
OpenUrl
↵
Boyle, E. A., Li, Y. I. & Pritchard, J. K. (2017), ‘An expanded view of complex traits: From polygenic to omnigenic’, Cell 169(7), 1177–1186. URL: https://doi.org/10.1016/j.cell.2017.05.038
OpenUrl CrossRef PubMed
↵
Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in ‘Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, KDD ‘16, ACM, New York, NY, USA, pp. 785–794. URL: http://doi.acm.org/10.1145/2939672.2939785
↵
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y. & Li, Y. (2019), xgboost: Extreme Gradient Boosting. R package version 0.81.0.1. URL: https://CRAN.R-project.org/package=xgboost
↵
Cirillo, E., Kutmon, M., Gonzalez Hernandez, M., Hooimeijer, T., Adriaens, M. E., Eijssen, L. M. T., Parnell, L. D., Coort, S. L. & Evelo, C. T. (2018), ‘From snps to pathways: Biological interpretation of type 2 diabetes (t2dm) genome wide association study (gwas) results’, PLOS ONE 13(4), 1–19. URL: https://doi.org/10.1371/journal.pone.0193515
OpenUrl CrossRef PubMed
↵
Cross-Disorder Group of the Psychiatric Genomics Consortium (2013), ‘Genetic relationship between five psychiatric disorders estimated from genome-wide snps’, Nature Genetics 45, 984 EP –. URL: https://doi.org/10.1038/ng.2711
OpenUrl CrossRef PubMed
↵
Efron, B., Tibshirani, R., Storey, J. D. & Tusher, V. (2001), ‘Empirical bayes analysis of a microarray experiment’, Journal of the American Statistical Association 96(456), 1151–1160. URL: https://doi.org/10.1198/016214501753382129
OpenUrl CrossRef Web of Science
↵
Friedman, J. H. (2001), ‘Greedy function approximation: A gradient boosting machine.’, The Annals of Statistics 29(5), 1189–1232. URL: https://doi.org/10.1214/aos/1013203451
OpenUrl CrossRef Web of Science
↵
Genovese, C. R., Roeder, K. & Wasserman, L. (2006), ‘False discovery control with p-value weighting’, Biometrika 93(3), 509–524. URL: https://doi.org/10.1093/biomet/93.3.509
OpenUrl CrossRef Web of Science
↵
GTEx Consortium (2015), ‘The genotype-tissue expression (gtex) pilot analysis: Multitissue gene regulation in humans’, Science 348(6235), 648–660. URL: https://science.sciencemag.org/content/348/6235/648
OpenUrl Abstract/FREE Full Text
↵
Ignatiadis, N., Klaus, B., Zaugg, J. B. & Huber, W. (2016), ‘Data-driven hypothesis weighting increases detection power in genome-scale multiple testing’, Nature methods 13(7), 577–580. URL: https://www.ncbi.nlm.nih.gov/pmc/PMC4930141/
OpenUrl
↵
Korthauer, K., Kimes, P. K., Duvallet, C., Reyes, A., Subramanian, A., Teng, M., Shukla, C., Alm, E. J. & Hicks, S. C. (2019), ‘A practical guide to methods controlling false discoveries in computational biology’, Genome Biology 20(1), 118. URL: https://doi.org/10.1186/s13059-019-1716-1
OpenUrl
↵
Langfelder, P. & Horvath, S. (2008), ‘Wgcna: an r package for weighted correlation network analysis’, BMC Bioinformatics 9(1) URL: https://doi.org/10.1186/1471-2105-9-559
↵
Langfelder, P. & Horvath, S. (2012), ‘Fast r functions for robust correlations and hierarchical clustering’, Journal of statistical software 46(11). URL: https://www.ncbi.nlm.nih.gov/pubmed/23050260
↵
Langfelder, P., Zhang, B. & Horvath, S. (2008), ‘Defining clusters from a hierarchical cluster tree: The dynamic tree cut package for r’, Bioinformatics (Oxford, England) 24.
↵
Lei, L. & Fithian, W. (2018), ‘Adapt: an interactive procedure for multiple testing with side information’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(4), 649–679. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12274
OpenUrl
↵
Lei, L., Ramdas, A. & Fithian, W. (2017), ‘STAR: A general interactive framework for FDR control under structural constraints’, ArXiv e-prints.
↵
Li, A. & Barber, R. F. (2019), ‘Multiple testing with the structure-adaptive benjaminihochberg algorithm’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(1), 45–74. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12298
OpenUrl
↵
Lichtenstein, P., Yip, B. H., Björk, C., Pawitan, Y., Cannon, T. D., Sullivan, P. F. & Hultman, C. M. (2009), ‘Common genetic determinants of schizophrenia and bipolar disorder in swedish families: a population-based study’, The Lancet 373(9659), 234–239. URL: https://doi.org/10.1016/S0140-6736(09)60072-6
OpenUrl CrossRef
↵
Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C., Vedantam, S., Buchkovich, M. L., Yang, J., Croteau-Chonka, D. C., Esko, T., Fall, T., Ferreira, T., Gustafsson, S., Kutalik, Z., Luan, J., Mägi, R., Randall, J. C., Winkler, T. W., Wood, A. R., Workalemahu, T., Faul, J. D., Smith, J. A., Hua Zhao, J., Zhao, W., Chen, J., Fehrmann, R., Hedman, Å. K., Karjalainen, J., Schmidt, E. M., Absher, D., Amin, N., Anderson, D., Beekman, M., Bolton, J. L., Bragg-Gresham, J. L., Buyske, S., Demirkan, A., Deng, G., Ehret, G. B., Feenstra, B., Feitosa, M. F., Fischer, K., Goel, A., Gong, J., Jackson, A. U., Kanoni, S., Kleber, M. E., Kristiansson, K., Lim, U., Lotay, V., Mangino, M., Mateo Leach, I., Medina-Gomez, C., Medland, S. E., Nalls, M. A., Palmer, C. D., Pasko, D., Pechlivanis, S., Peters, M. J., Prokopenko, I., Shungin, D., Stančáková, A., Strawbridge, R. J., Ju Sung, Y., Tanaka, T., Teumer, A., Trompet, S., van der Laan, S. W., van Setten, J., Van Vliet-Ostaptchouk, J. V., Wang, Z., Yengo, L., Zhang, W., Isaacs, A., Albrecht, E., Ärnlöv, J., Arscott, G. M., Attwood, A. P., Bandinelli, S., Barrett, A., Bas, I. N., Bellis, C., Bennett, A. J., Berne, C., Blagieva, R., Blüher, M., Böhringer, S., Bonnycastle, L. L., Böttcher, Y., Boyd, H. A., Bruinenberg, M., Caspersen, I. H., Ida Chen, Y.-D., Clarke, R., Warwick Daw, E., de Craen, A. J. M., Delgado, G., Dimitriou, M., Doney, A. S. F., Eklund, N., Estrada, K., Eury, E., Folkersen, L., Fraser, R. M., Garcia, M. E., Geller, F., Giedraitis, V., Gigante, B., Go, A. S., Golay, A., Goodall, A. H., Gordon, S. D., Gorski, M., Grabe, H.-J., Grallert, H., Grammer, T. B., Gräßler, J., Grönberg, H., Groves, C. J., Gusto, G., Haessler, J., Hall, P., Haller, T., Hallmans, G., Hartman, C. A., Hassinen, M., Hayward, C., Heard-Costa, N. L., Helmer, Q., Hengstenberg, C., Holmen, O., Hottenga, J.-J., James, A. L., Jeff, J. M., Johansson, Å., Jolley, J., Juliusdottir, T., Kinnunen, L., Koenig, W., Koskenvuo, M., Kratzer, W., Laitinen, J., Lamina, C., Leander, K., Lee, N. R., Lichtner, P., Lind, L., Lindström, J., Sin Lo, K., Lobbens, S., Lorbeer, R., Lu, Y., Mach, F., Magnusson, P. K. E., Mahajan, A., McArdle, W. L., McLachlan, S., Menni, C., Merger, S., Mihailov, E., Milani, L., Moayyeri, A., Monda, K. L., Morken, M. A., Mulas, A., Müller, G., Müller-Nurasyid, M., Musk, A. W., Nagaraja, R., Nöthen, M. M., Nolte, I. M., Pilz, S., Rayner, N. W., Renstrom, F., Rettig, R., Ried, J. S., Ripke, S., Robertson, N. R., Rose, L. M., Sanna, S., Scharnagl, H., Scholtens, S., Schumacher, F. R., Scott, W. R., Seufferlein, T., Shi, J., Vernon Smith, A., Smolonska, J., Stanton, A. V., Steinthorsdottir, V., Stirrups, K., Stringham, H. M., Sundström, J., Swertz, M. A., Swift, A. J., Syvänen, A.-C., Tan, S.-T., Tayo, B. O., Thorand, B., Thorleifsson, G., Tyrer, J. P., Uh, H.-W., Vandenput, L., Verhulst, F. C., Vermeulen, S. H., Verweij, N., Vonk, J. M., Waite, L. L., Warren, H. R., Waterworth, D., Weedon, M. N., Wilkens, L. R., Willenborg, C., Wilsgaard, T., Wojczynski, M. K., Wong, A., Wright, A. F., Zhang, Q., Study, T. L. C., Brennan, E. P., Choi, M., Dastani, Z., Drong, A. W., Eriksson, P., Franco-Cereceda, A., Gådin, J. R., Gharavi, A. G., Goddard, M. E., Handsaker, R. E., Huang, J., Karpe, F., Kathiresan, S., Keildson, S., Kiryluk, K., Kubo, M., Lee, J.-Y., Liang, L., Lifton, R. P., Ma, B., McCarroll, S. A., McKnight, A. J., Min, J. L., Moffatt, M. F., Montgomery, G. W., Murabito, J. M., Nicholson, G., Nyholt, D. R., Okada, Y., Perry, J. R. B., Dorajoo, R., Reinmaa, E., Salem, R. M., Sandholm, N., Scott, R. A., Stolk, L., Takahashi, A., Tanaka, T., van’t Hooft, F. M., Vinkhuyzen, A. A. E., Westra, H.-J., Zheng, W., Zondervan, K. T., Consortium, T. A., Group, T. A.-B. W., Consortium, T. C., Consortium, T. C., Glgc, T., Icbp, T., Investigators, T. M., Consortium, T. M., Consortium, T. M., Consortium, T. P., Consortium, T. R., Consortium, T. G., Consortium, T. I. E., Heath, A. C., Arveiler, D., Bakker, S. J. L., Beilby, J., Bergman, R. N., Blangero, J., Bovet, P., Campbell, H., Caulfield, M. J., Cesana, G., Chakravarti, A., Chasman, D. I., Chines, P. S., Collins, F. S., Crawford, D. C., Adrienne Cupples, L., Cusi, D., Danesh, J., de Faire, U., den Ruijter, H. M., Dominiczak, A. F., Erbel, R., Erdmann, J., Eriksson, J. G., Farrall, M., Felix, S. B., Ferrannini, E., Ferrières, J., Ford, I., Forouhi, N. G., Forrester, T., Franco, O. H., Gansevoort, R. T., Gejman, P. V., Gieger, C., Gottesman, O., Gudnason, V., Gyllensten, U., Hall, A. S., Harris, T. B., Hattersley, A. T., Hicks, A. A., Hindorff, L. A., Hingorani, A. D., Hofman, A., Homuth, G., Kees Hovingh, G., Humphries, S. E., Hunt, S. C., Hyppönen, E., Illig, T., Jacobs, K. B., Jarvelin, M.-R., Jöckel, K.-H., Johansen, B., Jousilahti, P., Wouter Jukema, J., Jula, A. M., Kaprio, J., Kastelein, J. J. P., Keinanen-Kiukaanniemi, S. M., Kiemeney, L. A., Knekt, P., Kooner, J. S., Kooperberg, C., Kovacs, P., Kraja, A. T., Kumari, M., Kuusisto, J., Lakka, T. A., Langenberg, C., Le Marchand, L., Lehtimäki, T., Lyssenko, V., Männistö, S., Marette, A., Matise, T. C., McKenzie, C. A., McKnight, B., Moll, F. L., Morris, A. D., Morris, A. P., Murray, J. C., Nelis, M., Ohlsson, C., Oldehinkel, A. J., Ong, K. K., Madden, P. A. F., Pasterkamp, G., Peden, J. F., Peters, A., Postma, D. S., Pramstaller, P. P., Price, J. F., Qi, L., Raitakari, O. T., Rankinen, T., Rao, D. C., Rice, T. K., Ridker, P. M., Rioux, J. D., Ritchie, M. D., Rudan, I., Salomaa, V., Samani, N. J., Saramies, J., Sarzynski, M. A., Schunkert, H., Schwarz, P. E. H., Sever, P., Shuldiner, A. R., Sinisalo, J., Stolk, R. P., Strauch, K., Tönjes, A., Trégoüet, D.-A., Tremblay, A., Tremoli, E., Virtamo, J., Vohl, M.-C., Völker, U., Waeber, G., Willemsen, G., Witteman, J. C., Carola Zillikens, M., Adair, L. S., Amouyel, P., Asselbergs, F. W., Assimes, T. L., Bochud, M., Boehm, B. O., Boerwinkle, E., Bornstein, S. R., Bottinger, E. P., Bouchard, C., Cauchi, S., Chambers, J. C., Chanock, S. J., Cooper, R. S., de Bakker, P. I. W., Dedoussis, G., Ferrucci, L., Franks, P. W., Froguel, P., Groop, L. C., Haiman, C. A., Hamsten, A., Hui, J., Hunter, D. J., Hveem, K., Kaplan, R. C., Kivimaki, M., Kuh, D., Laakso, M., Liu, Y., Martin, N. G., März, W., Melbye, M., Metspalu, A., Moebus, S., Munroe, P. B., Njølstad, I., Oostra, B. A., Palmer, C. N. A., Pedersen, N. L., Perola, M., Pérusse, L., Peters, U., Power, C., Quertermous, T., Rauramaa, R., Rivadeneira, F., Saaristo, T. E., Saleheen, D., Sattar, N., Schadt, E. E., Schlessinger, D., Eline Slagboom, P., Snieder, H., Spector, T. D., Thorsteinsdottir, U., Stumvoll, M., Tuomilehto, J., Uitterlinden, A., Uusitupa, M., van der Harst, P., Walker, M., Wallaschofski, H., Wareham, N. J., Watkins, H., Weir, D. R., Wichmann, H.-E., Wilson, J. F., Zanen, P., Borecki, I. B., Deloukas, P., Fox, C. S., Heid, I. M., O’Connell, J. R., Strachan, D. P., Stefansson, K., van Duijn, C. M., Abecasis, G. R., Franke, L., Frayling, T. M., McCarthy, M. I., Visscher, P. M., Scherag, A., Willer, C. J., Boehnke, M., Mohlke, K. L., Lindgren, C. M., Beckmann, J. S., Barroso, I., North, K. E., Ingelsson, E., Hirschhorn, J. N., Loos, R. J. F. & Speliotes, E. K. (2015), ‘Genetic studies of body mass index yield new insights for obesity biology’, Nature 518, 197 EP –. URL: http://dx.doi.org/10.1038/nature14177
OpenUrl CrossRef PubMed
↵
Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & Hirschhorn, J. N. (2003), ‘Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease’, Nature Genetics 33(2), 177–182. URL: https://doi.org/10.1038/ng1071
OpenUrl CrossRef PubMed Web of Science
↵
Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W., Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., Cook, J. P., Schmidt, E. M., Wuttke, M., Sarnowski, C., Mägi, R., Nano, J., Gieger, C., Trompet, S., Lecoeur, C., Preuss, M. H., Prins, B. P., Guo, X., Bielak, L. F., Below, J. E., Bowden, D. W., Chambers, J. C., Kim, Y. J., Ng, M. C. Y., Petty, L. E., Sim, X., Zhang, W., Bennett, A. J., Bork-Jensen, J., Brummett, C. M., Canouil, M., Ec kardt, K.-U., Fischer, K., Kardia, S. L. R., Kronenberg, F., Läll, K., Liu, C.-T., Locke, A. E., Luan, J., Ntalla, I., Nylander, V., Schönherr, S., Schurmann, C., Yengo, L., Bottinger, E. P., Brandslund, I., Christensen, C., Dedoussis, G., Florez, J. C., Ford, I., Franco, O. H., Frayling, T. M., Giedraitis, V., Hackinger, S., Hattersley, A. T., Herder, C., Ikram, M. A., Ingelsson, M., Jørgensen, M. E., Jørgensen, T., Kriebel, J., Kuusisto, J., Ligthart, S., Lindgren, C. M., Linneberg, A., Lyssenko, V., Mamakou, V., Meitinger, T., Mohlke, K. L., Morris, A. D., Nadkarni, G., Pankow, J. S., Peters, A., Sattar, N., Stančáková, A., Strauch, K., Taylor, K. D., Thorand, B., Thorleifsson, G., Thorsteinsdottir, U., Tuomilehto, J., Witte, D. R., Dupuis, J., Peyser, P. A., Zeggini, E., Loos, R. J. F., Froguel, P., Ingelsson, E., Lind, L., Groop, L., Laakso, M., Collins, F. S., Jukema, J. W., Palmer, C. N. A., Grallert, H., Metspalu, A., Dehghan, A., Köttgen, A., Abecasis, G. R., Meigs, J. B., Rotter, J. I., Marchini, J., Pedersen, O., Hansen, T., Langenberg, C., Wareham, N. J., Stefansson, K., Gloyn, A. L., Morris, A. P., Boehnke, M. & McCarthy, M. I. (2018), ‘Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps’, Nature Genetics 50(11), 1505–1513. URL: https://doi.org/10.1038/s41588-018-0241-6
OpenUrl CrossRef PubMed
↵
Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E. & Cox, N. J. (2010), ‘Traitassociated snps are more likely to be eqtls: Annotation to enhance discovery from gwas’, PLOS Genetics 6(4), 1–10. URL: https://doi.org/10.1371/journal.pgen.1000888
OpenUrl
↵
O’Connor, L. J., Schoech, A. P., Hormozdiari, F., Gazal, S., Patterson, N. & Price, A. L. (2019), ‘Extreme polygenicity of complex traits is explained by negative selection’, The American Journal of Human Genetics 105(3), 456 –476. URL: http://www.sciencedirect.com/science/article/pii/S0002929719302666
OpenUrl
↵
R Core Team (2018), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/
↵
Ruderfer, D. M., Fanous, A. H., Ripke, S., McQuillin, A., Amdur, R. L., of the Psychiatric Genomics Consortium, S. W. G., of the Psychiatric Genomics Consortium, B. D. W. G., of the Psychiatric Genomics Consortium, C.-D. W. G., Gejman, P. V., O’Donovan, M. C., Andreassen, O. A., Djurovic, S., Hultman, C. M., Kelsoe, J. R., Jamain, S., Landén, M., Leboyer, M., Nimgaonkar, V., Nurnberger, J., Smoller, J. W., Craddock, N., Corvin, A., Sullivan, P. F., Holmans, P., Sklar, P. & Kendler, K. S. (2014), ‘Polygenic dissection of diagnosis and clinical dimensions of bipolar disorder and schizophrenia’, Molecular psychiatry 19(9), 1017–1024. URL: https://www.ncbi.nlm.nih.gov/pubmed/24280982
OpenUrl CrossRef PubMed Web of Science
↵
Ruderfer, D. M., Ripke, S., McQuillin, A., Boocock, J., Stahl, E. A., Pavlides, J. M. W., Mullins, N., Charney, A. W., Ori, A. P., Loohuis, L. M. O., Domenici, E., Florio, A. D., Papiol, S., Kalman, J. L., Trubetskoy, V., Adolfsson, R., Agartz, I., Agerbo, E., Akil, H., Albani, D., Albus, M., Alda, M., Alexander, M., Alliey-Rodriguez, N., Als, T. D., Amin, F., Anjorin, A., Arranz, M. J., Awasthi, S., Bacanu, S. A., Badner, J. A., Baekvad-Hansen, M., Bakker, S., Band, G., Barchas, J. D., Barroso, I., Bass, N., Bauer, M., Baune, B. T., Begemann, M., Bellenguez, C., Belliveau, R. A., Bellivier, F., Bender, S., Bene, J., Bergen, S. E., Berrettini, W. H., Bevilacqua, E., Biernacka, J. M., Bigdeli, T. B., Black, D. W., Blackburn, H., Blackwell, J. M., Blackwood, D. H., Pedersen, C. B., Boehnke, M., Boks, M., Borglum, A. D., Bramon, E., Breen, G., Brown, M. A., Bruggeman, R., Buccola, N. G., Buckner, R. L., Budde, M., Bulik-Sullivan, B., Bumpstead, S. J., Bunney, W., Burmeister, M., Buxbaum, J. D., Bybjerg-Grauholm, J., Byerley, W., Cahn, W., Cai, G., Cairns, M. J., Campion, D., Cantor, R. M., Carr, V. J., Carrera, N., Casas, J. P., Casas, M., Catts, S. V., Cervantes, P., Chambert, K. D., Chan, R. C., Chen, E. Y., Chen, R. Y., Cheng, W., Cheung, E. F., Chong, S. A., Clarke, T.-K., Cloninger, C. R., Cohen, D., Cohen, N., Coleman, J. R., Collier, D. A., Cormican, P., Coryell, W., Craddock, N., Craig, D. W., Crespo-Facorro, B., Crowley, J. J., Cruceanu, C., Curtis, D., Czerski, P. M., Dale, A. M., Daly, M. J., Dannlowski, U., Darvasi, A., Davidson, M., Davis, K. L., de Leeuw, C. A., Degenhardt, F., Favero, J. D., DeLisi, L. E., Deloukas, P., Demontis, D., DePaulo, J. R., di Forti, M., Dikeos, D., Dinan, T., Djurovic, S., Dobbyn, A. L., Donnelly, P., Donohoe, G., Drapeau, E., Dronov, S., Duan, J., Dudbridge, F., Duncanson, A., Edenberg, H., Edkins, S., Ehrenreich, H., Eichhammer, P., Elvsashagen, T., Eriksson, J., Escott-Price, V., Esko, T., Essioux, L., Etain, B., Fan, C. C., Farh, K.-H., Farrell, M. S., Flickinger, M., Foroud, T. M., Forty, L., Frank, J., Franke, L., Fraser, C., Freedman, R., Freeman, C., Freimer, N. B., Friedman, J. I., Fromer, M., Frye, M. A., Fullerton, J. M., Gade, K., Garnham, J., Gaspar, H. A., Gejman, P. V., Genovese, G., Georgieva, L., Giambartolomei, C., Giannoulatou, E., Giegling, I., Gill, M., Gillman, M., Pedersen, M. G., Giusti-Rodriguez, P., Godard, S., Goes, F., Goldstein, J. I., Gopal, S., Gordon, S. D., Gordon-Smith, K., Gratten, J., Gray, E., Green, E. K., Green, M. J., Greenwood, T. A., Grigoroiu-Serbanescu, M., Grove, J., Guan, W., Gurling, H., Parra, J. G., Gwilliam, R., de Haan, L., Hall, J., Hall, M.-H., Hammer, C., Hammond, N., Hamshere, M. L., Hansen, M., Hansen, T., Haroutunian, V., Hartmann, A. M., Hauser, J., Hautzinger, M., Heilbronner, U., Hellenthal, G., Henskens, F. A., Herms, S., Hipolito, M., Hirschhorn, J. N., Hoffmann, P., Hollegaard, M. V., Hougaard, D. M., Huang, H., Huckins, L., Hultman, C. M., Hunt, S. E., Ikeda, M., Iwata, N., Iyegbe, C., Jablensky, A. V., Jamain, S., Jankowski, J., Jayakumar, A., Joa, I., Jones, I., Jones, L. A., Jonsson, E. G., Julia, A., Jureus, A., Kahler, A. K., Kahn, R. S., Kalaydjieva, L., Kandaswamy, R., Karachanak-Yankova, S., Karjalainen, J., Karlsson, R., Kavanagh, D., Keller, M. C., Kelly, B. J., Kelsoe, J., Kennedy, J. L., Khrunin, A., Kim, Y., Kirov, G., Kittel-Schneider, S., Klovins, J., Knight, J., Knott, S. V., Knowles, J. A., Kogevinas, M., Konte, B., Kravariti, E., Kucinskas, V., Kucinskiene, Z. A., Kupka, R., Kuzelova-Ptackova, H., Landen, M., Langford, C., Laurent, C., Lawrence, J., Lawrie, S., Lawson, W. B., Leber, M., Leboyer, M., Lee, P. H., Keong, J. L. C., Legge, S. E., Lencz, T., Lerer, B., Levinson, D. F., Levy, S. E., Lewis, C. M., Li, J. Z., Li, M., Li, Q. S., Li, T., Liang, K.-Y., Liddle, J., Lieberman, J., Limborska, S., Lin, K., Linszen, D. H., Lissowska, J., Liu, C., Liu, J., Lonnqvist, J., Loughland, C. M., Lubinski, J., Lucae, S., Macek, M., MacIntyre, D. J., Magnusson, P. K., Maher, B. S., Mahon, P. B., Maier, W., Malhotra, A. K., Mallet, J., Malt, U. F., Markus, H. S., Marsal, S., Martin, N. G., Mata, I., Mathew, C. G., Mattheisen, M., Mattingsdal, M., Mayoral, F., McCann, O. T., McCarley, R. W., McCarroll, S. A., McCarthy, M. I., McDonald, C., McElroy, S. L., McGuffin, P., McInnis, M. G., McIntosh, A. M., McKay, J. D., McMahon, F. J., Medeiros, H., Medland, S. E., Meier, S., Meijer, C. J., Melegh, B., Melle, I., Meng, F., Mesholam-Gately, R. I., Metspalu, A., Michie, P. T., Milani, L., Milanova, V., Mitchell, P. B., Mokrab, Y., Montgomery, G. W., Moran, J. L., Morken, G., Morris, D. W., Mors, O., Mortensen, P. B., Mowry, B. J., Mhleisen, T. W., Mller-Myhsok, B., Murphy, K. C., Murray, R. M., Myers, R. M., Myin-Germeys, I., Neale, B. M., Nelis, M., Nenadic, I., Nertney, D. A., Nestadt, G., Nicodemus, K. K., Nievergelt, C. M., Nikitina-Zake, L., Nimgaonkar, V., Nisenbaum, L., Nordentoft, M., Nordin, A., Nthen, M. M., Nwulia, E. A., Ocallaghan, E., Odonovan, C., Odushlaine, C., Oneill, F. A., Oedegaard, K. J., Oh, S.-Y., Olincy, A., Olsen, L., Oruc, L., Os, J. V., Owen, M. J., Paciga, S. A., Palmer, C. N., Palotie, A., Pantelis, C., Papadimitriou, G. N., Parkhomenko, E., Pato, C., Pato, M. T., Paunio, T., Pearson, R., Perkins, D. O., Perlis, R. H., Perry, A., Pers, T. H., Petryshen, T. L., Pfennig, A., Picchioni, M., Pietilainen, O., Pimm, J., Pirinen, M., Plomin, R., Pocklington, A. J., Posthuma, D., Potash, J. B., Potter, S. C., Powell, J., Price, A., Pulver, A. E., Purcell, S. M., Quested, D., Ramos-Quiroga, J. A., Rasmussen, H. B., Rautanen, A., Ravindrarajah, R., Regeer, E. J., Reichenberg, A., Reif, A., Reimers, M. A., Ribases, M., Rice, J. P., Richards, A. L., Ricketts, M., Riley, B. P., Rivas, F., Rivera, M., Roffman, J. L., Rouleau, G. A., Roussos, P., Rujescu, D., Salomaa, V., Sanchez-Mora, C., Sanders, A. R., Sawcer, S. J., Schall, U., Schatzberg, A. F., Scheftner, W. A., Schofield, P. R., Schork, N. J., Schwab, S. G., Scolnick, E. M., Scott, L. J., Scott, R. J., Seidman, L. J., Serretti, A., Sham, P. C., Weickert, C. S., Shehktman, T., Shi, J., Shilling, P. D., Sigurdsson, E., Silverman, J. M., Sim, K., Slaney, C., Slominsky, P., Smeland, O. B., Smoller, J. W., So, H.-C., Sobell, J. L., Soderman, E., Hansen, C. S., Spencer, C. C., Spijker, A. T., Clair, D. S., Stefansson, H., Stefansson, K., Steinberg, S., Stogmann, E., Stordal, E., Strange, A., Straub, R. E., Strauss, J. S., Streit, F., Strengman, E., Strohmaier, J., Stroup, T. S., Su, Z., Subramaniam, M., Suvisaari, J., Svrakic, D. M., Szatkiewicz, J. P., Szelinger, S., Tashakkori-Ghanbaria, A., Thirumalai, S., Thompson, R. C., Thorgeirsson, T. E., Toncheva, D., Tooney, P. A., Tosato, S., Toulopoulou, T., Trembath, R. C., Treutlein, J., Trubetskoy, V., Turecki, G., Vaaler, A. E., Vedder, H., Vieta, E., Vincent, J., Visscher, P. M., Viswanathan, A. C., Vukcevic, D., Waddington, J., Waller, M., Walsh, D., Walshe, M., Walters, J. T., Wang, D., Wang, Q., Wang, W., Wang, Y., Watson, S. J., Webb, B. T., Weickert, T. W., Weinberger, D. R., Weisbrod, M., Weiser, M., Werge, T., Weston, P., Whittaker, P., Widaa, S., Wiersma, D., Wildenauer, D. B., Williams, N. M., Williams, S., Witt, S. H., Wolen, A. R., Wong, E. H., Wood, N. W., Wormley, B. K., Wu, J. Q., Xi, S., Xu, W., Young, A. H., Zai, C. C., Zandi, P., Zhang, P., Zheng, X., Zimprich, F., Zollner, S., Corvin, A., Fanous, A. H., Cichon, S., Rietschel, M., Gershon, E. S., Schulze, T. G., Cuellar-Barboza, A. B., Forstner, A. J., Holmans, P. A., Nurnberger, J. I., Andreassen, O. A., Lee, S. H., ODonovan, M. C., Sullivan, P. F., Ophoff, R. A., Wray, N. R., Sklar, P. & Kendler, K. S. (2018), ‘Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes’, Cell 173(7), 1705–1715.e16. URL: http://www.sciencedirect.com/science/article/pii/S0092867418306585
OpenUrl CrossRef PubMed
↵
Scott, J. G., Kelly, R. C., Smith, M. A., Zhou, P. & Kass, R. E. (2015), ‘False discovery rate regression: An application to neural synchrony detection in primary visual cortex’, Journal of the American Statistical Association 110(510), 459–471. URL: https://doi.org/10.1080/01621459.2014.990973
OpenUrl
↵
Shungin, D., Winkler, T. W., Croteau-Chonka, D. C., Ferreira, T., Locke, A. E., Mägi, R., Strawbridge, R. J., Pers, T. H., Fischer, K., Justice, A. E., Workalemahu, T., Wu, J. M. W., Buchkovich, M. L., Heard-Costa, N. L., Roman, T. S., Drong, A. W., Song, C., Gustafsson, S., Day, F. R., Esko, T., Fall, T., Kutalik, Z., Luan, J., Randall, J. C., Scherag, A., Vedantam, S., Wood, A. R., Chen, J., Fehrmann, R., Karjalainen, J., Kahali, B., Liu, C.-T., Schmidt, E. M., Absher, D., Amin, N., Anderson, D., Beekman, M., Bragg-Gresham, J. L., Buyske, S., Demirkan, A., Ehret, G. B., Feitosa, M. F., Goel, A., Jackson, A. U., Johnson, T., Kleber, M. E., Kristiansson, K., Mangino, M., Mateo Leach, I., Medina-Gomez, C., Palmer, C. D., Pasko, D., Pechlivanis, S., Peters, M. J., Prokopenko, I., Stančáková, A., Ju Sung, Y., Tanaka, T., Teumer, A., Van Vliet-Ostaptchouk, J. V., Yengo, L., Zhang, W., Albrecht, E., Ärnlöv, J., Arscott, G. M., Bandinelli, S., Barrett, A., Bellis, C., Bennett, A. J., Berne, C., Blüher, M., Bohringer, S., Bonnet, F., Böttcher, Y., Bruinenberg, M., Carba, D. B., Caspersen, I. H., Clarke, R., Warwick Daw, E., Deelen, J., Deelman, E., Delgado, G., Doney, A. S. F., Eklund, N., Erdos, M. R., Estrada, K., Eury, E., Friedrich, N., Garcia, M. E., Giedraitis, V., Gigante, B., Go, A. S., Golay, A., Grallert, H., Grammer, T. B., Gräßler, J., Grewal, J., Groves, C. J., Haller, T., Hallmans, G., Hartman, C. A., Hassinen, M., Hayward, C., Heikkilä, K., Herzig, K.-H., Helmer, Q., Hillege, H. L., Holmen, O., Hunt, S. C., Isaacs, A., Ittermann, T., James, A. L., Johansson, I., Juliusdottir, T., Kalafati, I.-P., Kinnunen, L., Koenig, W., Kooner, I. K., Kratzer, W., Lamina, C., Leander, K., Lee, N. R., Lichtner, P., Lind, L., Lindström, J., Lobbens, S., Lorentzon, M., Mach, F., Magnusson, P. K. E., Mahajan, A., McArdle, W. L., Menni, C., Merger, S., Mihailov, E., Milani, L., Mills, R., Moayyeri, A., Monda, K. L., Mooijaart, S. P., Mühleisen, T. W., Mulas, A., Müller, G., Müller-Nurasyid, M., Nagaraja, R., Nalls, M. A., Narisu, N., Glorioso, N., Nolte, I. M., Olden, M., Rayner, N. W., Renstrom, F., Ried, J. S., Robertson, N. R., Rose, L. M., Sanna, S., Scharnagl, H., Scholtens, S., Sennblad, B., Seufferlein, T., Sitlani, C. M., Vernon Smith, A., Stirrups, K., Stringham, H. M., Sundström, J., Swertz, M. A., Swift, A. J., Syvänen, A.-C., Tayo, B. O., Thorand, B., Thorleifsson, G., Tomaschitz, A., Troffa, C., van Oort, F. V. A., Verweij, N., Vonk, J. M., Waite, L. L., Wennauer, R., Wilsgaard, T., Wojczynski, M. K., Wong, A., Zhang, Q., Hua Zhao, J., Brennan, E. P., Choi, M., Eriksson, P., Folkersen, L., Franco-Cereceda, A., Gharavi, A. G., Hedman, Å. K., Hivert, M.-F., Huang, J., Kanoni, S., Karpe, F., Keildson, S., Kiryluk, K., Liang, L., Lifton, R. P., Ma, B., McKnight, A. J., McPherson, R., Metspalu, A., Min, J. L., Moffatt, M. F., Montgomery, G. W., Murabito, J. M., Nicholson, G., Nyholt, D. R., Olsson, C., Perry, J. R. B., Reinmaa, E., Salem, R. M., Sandholm, N., Schadt, E. E., Scott, R. A., Stolk, L., Vallejo, E. E., Westra, H.-J., Zondervan, K. T., Consortium, T. A., Consortium, T. C., Consortium, T. C., Consortium, T. G., Consortium, T. G., Glgc, T., Icbp, T., Consortium, T. I. E., Study, T. L. C., Investigators, T. M., Consortium, T. M., Consortium, T. P., Consortium, T. R., Amouyel, P., Arveiler, D., Bakker, S. J. L., Beilby, J., Bergman, R. N., Blangero, J., Brown, M. J., Burnier, M., Campbell, H., Chakravarti, A., Chines, P. S., Claudi-Boehm, S., Collins, F. S., Crawford, D. C., Danesh, J., de Faire, U., de Geus, E. J. C., Dörr, M., Erbel, R., Eriksson, J. G., Farrall, M., Ferrannini, E., Ferrières, J., Forouhi, N. G., Forrester, T., Franco, O. H., Gansevoort, R. T., Gieger, C., Gudnason, V., Haiman, C. A., Harris, T. B., Hattersley, A. T., Heliövaara, M., Hicks, A. A., Hingorani, A. D., Hoffmann, W., Hofman, A., Homuth, G., Humphries, S. E., Hyppönen, E., Illig, T., Jarvelin, M.-R., Johansen, B., Jousilahti, P., Jula, A. M., Kaprio, J., Kee, F., Keinanen-Kiukaanniemi, S. M., Kooner, J. S., Kooperberg, C., Kovacs, P., Kraja, A. T., Kumari, M., Kuulasmaa, K., Kuusisto, J., Lakka, T. A., Langenberg, C., Le Marchand, L., Lehtimäki, T., Lyssenko, V., Männistö, S., Marette, A., Matise, T. C., McKenzie, C. A., McKnight, B., Musk, A. W., Möhlenkamp, S., Morris, A. D., Nelis, M., Ohlsson, C., Oldehinkel, A. J., Ong, K. K., Palmer, L. J., Penninx, B. W., Peters, A., Pramstaller, P. P., Raitakari, O. T., Rankinen, T., Rao, D. C., Rice, T. K., Ridker, P. M., Ritchie, M. D., Rudan, I., Salomaa, V., Samani, N. J., Saramies, J., Sarzynski, M. A., Schwarz, P. E. H., Shuldiner, A. R., Staessen, J. A., Steinthorsdottir, V., Stolk, R. P., Strauch, K., Tönjes, A., Tremblay, A., Tremoli, E., Vohl, M.-C., Völker, U., Vollenweider, P., Wilson, J. F., Witteman, J. C., Adair, L. S., Bochud, M., Boehm, B. O., Bornstein, S. R., Bouchard, C., Cauchi, S., Caulfield, M. J., Chambers, J. C., Chasman, D. I., Cooper, R. S., Dedoussis, G., Ferrucci, L., Froguel, P., Grabe, H.-J., Hamsten, A., Hui, J., Hveem, K., Jöckel, K.-H., Kivimaki, M., Kuh, D., Laakso, M., Liu, Y., Marz, W., Munroe, P. B., Njølstad, I., Oostra, B. A., Palmer, C. N. A., Pedersen, N. L., Perola, M., Pérusse, L., Peters, U., Power, C., Quertermous, T., Rauramaa, R., Rivadeneira, F., Saaristo, T. E., Saleheen, D., Sinisalo, J., Eline Slagboom, P., Snieder, H., Spector, T. D., Thorsteinsdottir, U., Stumvoll, M., Tuomilehto, J., Uitterlinden, A., Uusitupa, M., van der Harst, P., Veronesi, G., Walker, M., Wareham, N. J., Watkins, H., Wichmann, H.-E., Abecasis, G. R., Assimes, T. L., Berndt, S. I., Boehnke, M., Borecki, I. B., Deloukas, P., Franke, L., Frayling, T. M., Groop, L. C., Hunter, D. J., Kaplan, R. C., O’Connell, J. R., Qi, L., Schlessinger, D., Strachan, D. P., Stefansson, K., van Duijn, C. M., Willer, C. J., Visscher, P. M., Yang, J., Hirschhorn, J. N., Carola Zillikens, M., McCarthy, M. I., Speliotes, E. K., North, K. E., Fox, C. S., Barroso, I., Franks, P. W., Ingelsson, E., Heid, I. M., Loos, R. J. F., Cupples, L. A., Morris, A. P., Lindgren, C. M. & Mohlke, K. L. (2015), ‘New genetic loci link adipose and insulin biology to body fat distribution’, Nature 518. URL: https://doi.org/10.1038/nature14132
↵
Storey, J. D. (2002), ‘A direct approach to false discovery rates’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3), 479–498. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00346
OpenUrl CrossRef Web of Science
↵
The Gene Ontology Consortium (2018), ‘The gene ontology resource: 20 years and still going strong’, Nucleic Acids Research 47(D1), D330–D338. URL: https://doi.org/10.1093/nar/gky1055
OpenUrl CrossRef
↵
Werling, D. M., Pochareddy, S., Choi, J., An, J.-Y., Sheppard, B., Peng, M., Li, Z., Dastmalchi, C., Santpere, G., Sousa, A. M. M., Tebbenkamp, A. T. N., Kaur, N., Gulden, F. O., Breen, M. S., Liang, L., Gilson, M. C., Zhao, X., Dong, S., Klei, L., Cicek, A. E., Buxbaum, J. D., Adle-Biassette, H., Thomas, J.-L., Aldinger, K. A., O’Day, D. R., Glass, I. A., Zaitlen, N. A., Talkowski, M. E., Roeder, K., State, M. W., Devlin, B., Sanders, S. J. & Sestan, N. (2019), ‘Whole-genome and rna sequencing reveal variation and transcriptomic coordination in the developing human prefrontal cortex’, bioRxiv. URL: https://www.biorxiv.org/content/early/2019/03/22/585430
↵
Willer, C. J., Li, Y. & Abecasis, G. R. (2010), ‘Metal: fast and efficient meta-analysis of genomewide association scans’, Bioinformatics 26(17), 2190–2191. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2922887/
OpenUrl CrossRef PubMed Web of Science
↵
Xiao, N., Wang, G. & Sun, L. (2018), grex: Gene ID Mapping for Genotype-Tissue Expression (GTEx) Data. R package version 1.8. URL: https://CRAN.R-project.org/package=grex
↵
Yengo, L., Sidorenko, J., Kemper, K. E., Zheng, Z., Wood, A. R., Weedon, M. N., Frayling, T. M., Hirschhorn, J., Yang, J., Visscher, P. M. & the GIANT Consortium (2018), ‘Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of european ancestry’, Human Molecular Genetics 27(20), 3641–3649. URL: https://doi.org/10.1093/hmg/ddy271
OpenUrl CrossRef PubMed
↵
Zhang, B. & Horvath, S. (2005), ‘A general framework for weighted gene co-expression network analysis a general framework for weighted gene co-expression network analysis’, Statistical Applications in Genetics and Molecular Biology 4(1).
↵
Zhang, M. J., Xia, F. & Zou, J. (2019), ‘Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing’, Nature Communications 10(1), 3433. URL: https://doi.org/10.1038/s41467-019-11247-0
OpenUrl

View the discussion thread.

Posted October 16, 2019.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5201)
Biochemistry (11718)
Bioengineering (8724)
Bioinformatics (29132)
Biophysics (14936)
Cancer Biology (12051)
Cell Biology (17360)
Clinical Trials (138)
Developmental Biology (9406)
Ecology (14146)
Epidemiology (2067)
Evolutionary Biology (18269)
Genetics (12223)
Genomics (16768)
Immunology (11844)
Microbiology (28016)
Molecular Biology (11560)
Neuroscience (60822)
Paleontology (450)
Pathology (1864)
Pharmacology and Toxicology (3231)
Physiology (4940)
Plant Biology (10401)
Scientific Communication and Education (1680)
Synthetic Biology (2878)
Systems Biology (7333)
Zoology (1642)

[1] ↵
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000), ‘Gene ontology: tool for the unification of biology’, Nature Genetics 25(1), 25–29. URL: https://doi.org/10.1038/75556
OpenUrl CrossRef PubMed Web of Science

[2] ↵
Benjamini, Y. & Hochberg, Y. (1995), ‘Controlling the false discovery rate: A practical and powerful approach to multiple testing’, Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 289–300. URL: http://www.jstor.org/stable/2346101
OpenUrl CrossRef Web of Science

[3] ↵
Boca, S. M. & Leek, J. T. (2018), ‘A direct approach to estimating false discovery rates conditional on covariates’, PeerJ 6, e6035. URL: https://doi.org/10.7717/peerj.6035
OpenUrl

[4] ↵
Boyle, E. A., Li, Y. I. & Pritchard, J. K. (2017), ‘An expanded view of complex traits: From polygenic to omnigenic’, Cell 169(7), 1177–1186. URL: https://doi.org/10.1016/j.cell.2017.05.038
OpenUrl CrossRef PubMed

[5] ↵
Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in ‘Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, KDD ‘16, ACM, New York, NY, USA, pp. 785–794. URL: http://doi.acm.org/10.1145/2939672.2939785

[6] ↵
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y. & Li, Y. (2019), xgboost: Extreme Gradient Boosting. R package version 0.81.0.1. URL: https://CRAN.R-project.org/package=xgboost

[7] ↵
Cirillo, E., Kutmon, M., Gonzalez Hernandez, M., Hooimeijer, T., Adriaens, M. E., Eijssen, L. M. T., Parnell, L. D., Coort, S. L. & Evelo, C. T. (2018), ‘From snps to pathways: Biological interpretation of type 2 diabetes (t2dm) genome wide association study (gwas) results’, PLOS ONE 13(4), 1–19. URL: https://doi.org/10.1371/journal.pone.0193515
OpenUrl CrossRef PubMed

[8] ↵
Cross-Disorder Group of the Psychiatric Genomics Consortium (2013), ‘Genetic relationship between five psychiatric disorders estimated from genome-wide snps’, Nature Genetics 45, 984 EP –. URL: https://doi.org/10.1038/ng.2711
OpenUrl CrossRef PubMed

[9] ↵
Efron, B., Tibshirani, R., Storey, J. D. & Tusher, V. (2001), ‘Empirical bayes analysis of a microarray experiment’, Journal of the American Statistical Association 96(456), 1151–1160. URL: https://doi.org/10.1198/016214501753382129
OpenUrl CrossRef Web of Science

[10] ↵
Friedman, J. H. (2001), ‘Greedy function approximation: A gradient boosting machine.’, The Annals of Statistics 29(5), 1189–1232. URL: https://doi.org/10.1214/aos/1013203451
OpenUrl CrossRef Web of Science

[11] ↵
Genovese, C. R., Roeder, K. & Wasserman, L. (2006), ‘False discovery control with p-value weighting’, Biometrika 93(3), 509–524. URL: https://doi.org/10.1093/biomet/93.3.509
OpenUrl CrossRef Web of Science

[12] ↵
GTEx Consortium (2015), ‘The genotype-tissue expression (gtex) pilot analysis: Multitissue gene regulation in humans’, Science 348(6235), 648–660. URL: https://science.sciencemag.org/content/348/6235/648
OpenUrl Abstract/FREE Full Text

[13] ↵
Ignatiadis, N., Klaus, B., Zaugg, J. B. & Huber, W. (2016), ‘Data-driven hypothesis weighting increases detection power in genome-scale multiple testing’, Nature methods 13(7), 577–580. URL: https://www.ncbi.nlm.nih.gov/pmc/PMC4930141/
OpenUrl

[14] ↵
Korthauer, K., Kimes, P. K., Duvallet, C., Reyes, A., Subramanian, A., Teng, M., Shukla, C., Alm, E. J. & Hicks, S. C. (2019), ‘A practical guide to methods controlling false discoveries in computational biology’, Genome Biology 20(1), 118. URL: https://doi.org/10.1186/s13059-019-1716-1
OpenUrl

[15] ↵
Langfelder, P. & Horvath, S. (2008), ‘Wgcna: an r package for weighted correlation network analysis’, BMC Bioinformatics 9(1) URL: https://doi.org/10.1186/1471-2105-9-559

[16] ↵
Langfelder, P. & Horvath, S. (2012), ‘Fast r functions for robust correlations and hierarchical clustering’, Journal of statistical software 46(11). URL: https://www.ncbi.nlm.nih.gov/pubmed/23050260

[17] ↵
Langfelder, P., Zhang, B. & Horvath, S. (2008), ‘Defining clusters from a hierarchical cluster tree: The dynamic tree cut package for r’, Bioinformatics (Oxford, England) 24.

[18] ↵
Lei, L. & Fithian, W. (2018), ‘Adapt: an interactive procedure for multiple testing with side information’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(4), 649–679. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12274
OpenUrl

[19] ↵
Lei, L., Ramdas, A. & Fithian, W. (2017), ‘STAR: A general interactive framework for FDR control under structural constraints’, ArXiv e-prints.

[20] ↵
Li, A. & Barber, R. F. (2019), ‘Multiple testing with the structure-adaptive benjaminihochberg algorithm’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(1), 45–74. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12298
OpenUrl

[21] ↵
Lichtenstein, P., Yip, B. H., Björk, C., Pawitan, Y., Cannon, T. D., Sullivan, P. F. & Hultman, C. M. (2009), ‘Common genetic determinants of schizophrenia and bipolar disorder in swedish families: a population-based study’, The Lancet 373(9659), 234–239. URL: https://doi.org/10.1016/S0140-6736(09)60072-6
OpenUrl CrossRef

[23] ↵
Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & Hirschhorn, J. N. (2003), ‘Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease’, Nature Genetics 33(2), 177–182. URL: https://doi.org/10.1038/ng1071
OpenUrl CrossRef PubMed Web of Science

[24] ↵
Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W., Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., Cook, J. P., Schmidt, E. M., Wuttke, M., Sarnowski, C., Mägi, R., Nano, J., Gieger, C., Trompet, S., Lecoeur, C., Preuss, M. H., Prins, B. P., Guo, X., Bielak, L. F., Below, J. E., Bowden, D. W., Chambers, J. C., Kim, Y. J., Ng, M. C. Y., Petty, L. E., Sim, X., Zhang, W., Bennett, A. J., Bork-Jensen, J., Brummett, C. M., Canouil, M., Ec kardt, K.-U., Fischer, K., Kardia, S. L. R., Kronenberg, F., Läll, K., Liu, C.-T., Locke, A. E., Luan, J., Ntalla, I., Nylander, V., Schönherr, S., Schurmann, C., Yengo, L., Bottinger, E. P., Brandslund, I., Christensen, C., Dedoussis, G., Florez, J. C., Ford, I., Franco, O. H., Frayling, T. M., Giedraitis, V., Hackinger, S., Hattersley, A. T., Herder, C., Ikram, M. A., Ingelsson, M., Jørgensen, M. E., Jørgensen, T., Kriebel, J., Kuusisto, J., Ligthart, S., Lindgren, C. M., Linneberg, A., Lyssenko, V., Mamakou, V., Meitinger, T., Mohlke, K. L., Morris, A. D., Nadkarni, G., Pankow, J. S., Peters, A., Sattar, N., Stančáková, A., Strauch, K., Taylor, K. D., Thorand, B., Thorleifsson, G., Thorsteinsdottir, U., Tuomilehto, J., Witte, D. R., Dupuis, J., Peyser, P. A., Zeggini, E., Loos, R. J. F., Froguel, P., Ingelsson, E., Lind, L., Groop, L., Laakso, M., Collins, F. S., Jukema, J. W., Palmer, C. N. A., Grallert, H., Metspalu, A., Dehghan, A., Köttgen, A., Abecasis, G. R., Meigs, J. B., Rotter, J. I., Marchini, J., Pedersen, O., Hansen, T., Langenberg, C., Wareham, N. J., Stefansson, K., Gloyn, A. L., Morris, A. P., Boehnke, M. & McCarthy, M. I. (2018), ‘Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps’, Nature Genetics 50(11), 1505–1513. URL: https://doi.org/10.1038/s41588-018-0241-6
OpenUrl CrossRef PubMed

[25] ↵
Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E. & Cox, N. J. (2010), ‘Traitassociated snps are more likely to be eqtls: Annotation to enhance discovery from gwas’, PLOS Genetics 6(4), 1–10. URL: https://doi.org/10.1371/journal.pgen.1000888
OpenUrl

[26] ↵
O’Connor, L. J., Schoech, A. P., Hormozdiari, F., Gazal, S., Patterson, N. & Price, A. L. (2019), ‘Extreme polygenicity of complex traits is explained by negative selection’, The American Journal of Human Genetics 105(3), 456 –476. URL: http://www.sciencedirect.com/science/article/pii/S0002929719302666
OpenUrl

[27] ↵
R Core Team (2018), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/

[28] ↵
Ruderfer, D. M., Fanous, A. H., Ripke, S., McQuillin, A., Amdur, R. L., of the Psychiatric Genomics Consortium, S. W. G., of the Psychiatric Genomics Consortium, B. D. W. G., of the Psychiatric Genomics Consortium, C.-D. W. G., Gejman, P. V., O’Donovan, M. C., Andreassen, O. A., Djurovic, S., Hultman, C. M., Kelsoe, J. R., Jamain, S., Landén, M., Leboyer, M., Nimgaonkar, V., Nurnberger, J., Smoller, J. W., Craddock, N., Corvin, A., Sullivan, P. F., Holmans, P., Sklar, P. & Kendler, K. S. (2014), ‘Polygenic dissection of diagnosis and clinical dimensions of bipolar disorder and schizophrenia’, Molecular psychiatry 19(9), 1017–1024. URL: https://www.ncbi.nlm.nih.gov/pubmed/24280982
OpenUrl CrossRef PubMed Web of Science

[30] ↵
Scott, J. G., Kelly, R. C., Smith, M. A., Zhou, P. & Kass, R. E. (2015), ‘False discovery rate regression: An application to neural synchrony detection in primary visual cortex’, Journal of the American Statistical Association 110(510), 459–471. URL: https://doi.org/10.1080/01621459.2014.990973
OpenUrl

[32] ↵
Storey, J. D. (2002), ‘A direct approach to false discovery rates’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3), 479–498. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00346
OpenUrl CrossRef Web of Science

[33] ↵
The Gene Ontology Consortium (2018), ‘The gene ontology resource: 20 years and still going strong’, Nucleic Acids Research 47(D1), D330–D338. URL: https://doi.org/10.1093/nar/gky1055
OpenUrl CrossRef

[34] ↵
Werling, D. M., Pochareddy, S., Choi, J., An, J.-Y., Sheppard, B., Peng, M., Li, Z., Dastmalchi, C., Santpere, G., Sousa, A. M. M., Tebbenkamp, A. T. N., Kaur, N., Gulden, F. O., Breen, M. S., Liang, L., Gilson, M. C., Zhao, X., Dong, S., Klei, L., Cicek, A. E., Buxbaum, J. D., Adle-Biassette, H., Thomas, J.-L., Aldinger, K. A., O’Day, D. R., Glass, I. A., Zaitlen, N. A., Talkowski, M. E., Roeder, K., State, M. W., Devlin, B., Sanders, S. J. & Sestan, N. (2019), ‘Whole-genome and rna sequencing reveal variation and transcriptomic coordination in the developing human prefrontal cortex’, bioRxiv. URL: https://www.biorxiv.org/content/early/2019/03/22/585430

[35] ↵
Willer, C. J., Li, Y. & Abecasis, G. R. (2010), ‘Metal: fast and efficient meta-analysis of genomewide association scans’, Bioinformatics 26(17), 2190–2191. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2922887/
OpenUrl CrossRef PubMed Web of Science

[36] ↵
Xiao, N., Wang, G. & Sun, L. (2018), grex: Gene ID Mapping for Genotype-Tissue Expression (GTEx) Data. R package version 1.8. URL: https://CRAN.R-project.org/package=grex

[37] ↵
Yengo, L., Sidorenko, J., Kemper, K. E., Zheng, Z., Wood, A. R., Weedon, M. N., Frayling, T. M., Hirschhorn, J., Yang, J., Visscher, P. M. & the GIANT Consortium (2018), ‘Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of european ancestry’, Human Molecular Genetics 27(20), 3641–3649. URL: https://doi.org/10.1093/hmg/ddy271
OpenUrl CrossRef PubMed

[38] ↵
Zhang, B. & Horvath, S. (2005), ‘A general framework for weighted gene co-expression network analysis a general framework for weighted gene co-expression network analysis’, Statistical Applications in Genetics and Molecular Biology 4(1).

[39] ↵
Zhang, M. J., Xia, F. & Zou, J. (2019), ‘Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing’, Nature Communications 10(1), 3433. URL: https://doi.org/10.1038/s41467-019-11247-0
OpenUrl

Application of post-selection inference to multi-omics data yields insights into the etiologies of human diseases

Abstract

Results

Methodology overview

Data

AdaPT discoveries

Variable importance and relationships

Replication in independent studies

Gene ontology comparison

Pipeline results for all 2018 studies

Discussion

Methods

Two-groups model

AdaPT gradient boosted trees with CV steps

Computational aspects of AdaPT

Supporting Information Text

S1 GTEx covariates for SCZ

S2 SCZ variable importance and partial dependence

S3 Replication simulations

S4 SCZ results with all 2018 studies

S5 Type 2 diabetes results

S6 BMI results

S7 CV tuning for SCZ, T2D, and BMI results

S8 Selection of s0 and number of CV steps

S9 Dependent p-value block simulation

S10 Simulations demonstrating effects of overfitting

Footnotes

References

Citation Manager Formats

Subject Area

S8 Selection of s₀ and number of CV steps