Incorporating prior information into signal-detection analyses across biologically informed gene-sets

Signal detection analyses are used to assess whether there is any evidence of signal within a large collection of hypotheses. For example, we may wish to assess whether there is any evidence of association with disease among a set of biologically related genes. Such an analysis typically treats all genes within the sets similarly, even though there is substantial information concerning the likely importance of each gene within each set. For example, deleterious variants within genes that show evidence of purifying selection are more likely to substantially affect the phenotype than genes that are not under purifying selection, at least for traits that are themselves subject to purifying selection. Here we improve such analyses by incorporating prior information into a higher-criticism-based signal detection analysis. We show that when this prior information is predictive of whether a gene is associated with disease, our approach can lead to a significant increase in power. We illustrate our approach with a gene-set analysis of amyotrophic lateral sclerosis (ALS), which implicates a number of gene-sets containing SOD1 and NEK1 as well as showing enrichment of small p-values for gene-sets containing known ALS genes.

there is substantial information concerning the likely importance of each gene within 23 each set. For example, deleterious variants within genes that show evidence of purifying 24 selection are more likely to substantially affect the phenotype than genes that are not 25 under purifying selection, at least for traits that are themselves subject to purifying 26 selection. Here we improve such analyses by incorporating prior information into a 27 higher-criticism-based signal detection analysis. We show that when this prior 28 information is predictive of whether a gene is associated with disease, our approach can 29 lead to a significant increase in power. We illustrate our approach with a gene-set to identifying genetic variation that may be associated with disease. Unlike genome-44 wide association studies which depend on linkage disequilibrium between tag SNPs and 45 pathogenic variation, WES and WGS studies are able to assay pathogenic variation 46 directly, and as a result, are able to directly interrogate the role of rare variation in 47 disease. When the disease phenotype impacts the fitness of an individual, variants with 48 a large effect on the phenotype will tend to be rare, as they will tend to be pruned out of 49 the population before reaching appreciable frequency by purifying selection. This has  Further, when such an analysis is restricted to rare variation, a gene that demonstrates 61 an excess of deleterious variants in cases over controls provides strong evidence for 62 the KS test by Incorporating prior information into higher-criticism statistics. To incorporate prior 131 information into the HC framework, we assume that for the !! gene ( !! hypothesis 132 being tested) there is affiliated a weight, ! ≥ 0, such that ! quantifies the relative 133 importance of a gene within the gene-set.
where ( * ) is the empirical distribution function of the * s and ! * is cumulative 136 distribution function of * under the global null hypothesis. We can show that 137 It is not difficult to see that * is of the same form as the unweighted HC statistic 139 studied by (Jaeschke, 1979) which was shown to converge in distribution to the Gumbel 140 distribution as goes to infinity. However, as noted by Barrett and Lin (2014), this 141 convergence is extremely slow and unlikely to yield a good approximation in most cases. 142 As a result, we use permutation to approximate the null distribution of * . When testing 143 across a large number of gene-sets, we use the algorithm proposed by Ge, Dudoit, & within a gene-set. We consider three main sources of this information: 1) Genic 149 intolerance; 2) Network centrality; and 3) Gene expression in disease relevant tissues. Goldstein, 2013). Thus, if a gene has less functional variation than expected given the 158 total amount of variation within the gene, it will have a negative RVIS score. If it has 159 more functional variation than expected, it will tend to have a positive score. RVIS has 160 been shown to be strongly predictive of Mendelian disease genes, especially those that 161 lead to early-onset severe disease phenotypes (Petrovski, Wang, Heinzen, Allen, & 162 Goldstein, 2013). 163

164
Here, we calculate a gene's intolerance-based weight, !" , as the gene's intolerance 165 percentile among all 18536 scored genes, scaled to be between 0 and 2. By rescaling, 166 we ensure that those genes that have intolerance scores that are less than the mean, 167 and hence are more likely to be important in disease etiology, are given more 168 importance in the overall gene-set, by decreasing their p-values. gene set can be represented by a network. In such a representation, nodes denote 172 genes and the edges connecting them represent gene-gene interactions. It is quite 173 common in biologic networks for a few genes to have a much larger number of 174 connections than other genes. These highly connected genes are referred to as "hub" 175 genes, and it is reasonable to hypothesize that deleterious mutations within such genes 176 might be more disruptive of the biologic process represented by the network than 177 mutations falling within less connected, more distal, genes. 178

179
The connectivity of a node is captured in the graph theory concept of "centrality" (White For a given centrality measure, Let ! be the centrality for the !! gene. In order to 197 generate weights that result in smaller p-values for more highly connected genes, we 198 where is the mean centrality across the gene set, and 199 are user-defined constants (here we take = 0.95 and = 0.05), and is a scaling 200 factor so that the mean of the weights is one. 201 . 202 Gene expression in disease-related tissues. Genes that are important in disease 203 etiology are more likely to be expressed in disease-related tissues during the 204 developmental period leading to the disease. Therefore, for the !! gene we define a 205 in a disease-related tissue, is the mean expression across all genes in the gene set, 207 and are user-defined constants (here we take = 0.95 and = 0.05), and is a 208 scaling factor so that the mean of the weights is one. 209 210

Simulation study 211
We conduct a simple simulation study to evaluate the utility of our approach. For each 212 scenario, we simulate 1e+4 datasets. For each simulated dataset we generate 213 independent statistics ! , = 1, … , , associated with hypotheses. Let be the 214 proportion of the !!! that are generated under the alternative. We assume ! ∼ ( , 1) under the alternative and ! ∼ (0,1) under the null. Thus, marginally, ! ∼ , 1 + 216 1 − 0,1 . Note that characterizes the sparsity of the alternatives among all the 217 hypotheses tested while controls the location shift from null to alternative. Thus, in our 218 simulations, we evaluate the power of our approach as and vary and choose 219 configurations that explore the detection boundary outlined by Donoho & Jin, 2014. 220 Each ! is converted to a p-value via ! = Φ(−| ! |). Weights are generated from a 221 truncated exponential distribution and then scaled to have mean one. We consider three 222 different scenarios: 1) weights are randomly assigned to genes; 2) weights are 223 negatively correlated with disease-associated genes so that their p-values in the * 224 statistic are decreased, increasing their influence on the statistic; and 3) weights are 225 positively correlated with disease-associated genes so that these genes will have less 226 influence on the * statistic while the influence of genes that are not disease-227 associated will be increased. We generate a large number of simulated datasets under 228 the global null (i.e., = 0) and use these to calculate a rejection threshold for each 229 scenario. Specifically, we take the top 5 th percentile of and the * statistics 230 hypotheses, so that null hypothesis are given more influence on the HC statistic, that we 290 see a substantial negative effect on power when using weighting (red dash lines). 291 However, in real applications one would expect that most weighting schemes would be 292 somewhat informative of which genes would be disease-related. Thus, these results 293 suggest that there is little downside to weighting individual hypotheses in HC analyses. 294 295 ALS data analysis. 296 We found that marginally associated genes, had a strong effect on all HC analyses 297 (weighted or not). For example, all gene-sets containing SOD1 (260) and NEK1 (21) are 298 significantly associated with ALS after multiplicity adjustment, regardless of the HC 299 statistic used (table 2). GSEA (Subramanian et al., 2005) fails to detect any significant 300 gene sets. To investigate whether there is residual signal in gene sets after the 301 marginally significant genes are removed, we conducted gene set analyses that 302 excluded SOD1 and NEK1 from inclusion in any gene set. This analysis did not detect 303 any significant gene sets after multiplicity adjustment, regardless of the method used. 304 gene sets involving 51 known ALS disease genes highlighted in Cirulli et al. 2015 (Table  307   S1). The results of these analyses are presented in table 3 and one can see that 308 weighting based on pageRank centralities performs well. Since many of these gene sets 309 are likely devoid of any signal, we repeated this analysis while further restricting the 310 gene sets considered to those where there was at least one gene-set analysis approach 311 yielding a marginally significant result (p<=0.05) (table 4). Once again, we find that 312 pageRank centrality does well and that HC outperforms GSEA. 313

315
We have presented a new gene-set based analysis that incorporates prior information 316 into the analysis using a higher criticism approach. In both simulation studies and real 317 data analyses, we showed that such an approach can lead to higher power. However, 318 the choice of weights is important and consideration should be made for what 319 information is most likely to be predictive of truly associated disease genes. For 320 example, in our p-value enrichment analyses of known ALS genes, we found little 321 enrichment when we used genic intolerance measures as our weights. As genic 322 intolerance is indicative of purifying selection, this choice of weights may be less 323 informative in a late-onset disorder such as ALS. Results would likely be different for 324 earlier-onset disorders such as autism spectrum disorder, epilepsy, or schizophrenia. 325 Further applications across a spectrum of diseases are needed before general 326 recommendations can be made with respect to weighting schemes. 327 extreme deviations (by taking a max) from expectation under a global null that none of 330 the genes within the gene set are associated with the disease. Though this approach 331 has been shown to be optimal in detecting sparse signals within a large collection of 332 hypotheses, it may be less sensitive to detecting signal that is more diffuse. In such a 333 case, there may be an advantage in integrating over the tail of the distribution of 334 deviations rather than taking a max. We are currently investigating this approach and 335 plan to highlight it in a future manuscript.