Bayesian model comparison for rare variant association studies of multiple phenotypes

Whole genome sequencing studies applied to large populations or biobanks with extensive phenotyping raise new analytic challenges. The need to consider many variants at a locus or group of genes simultaneously and the potential to study many correlated phenotypes with shared genetic architecture provide opportunities for discovery and inference that are not addressed by the traditional one variant-one phenotype association study. Here we introduce a model comparison approach we refer to as MRP for rare variant association studies that considers correlation, scale, and location of genetic effects across a group of genetic variants, phenotypes, and studies. We consider the use of summary statistic data to apply univariate and multivariate gene-based meta-analysis models for identifying rare variant associations with an emphasis on protective protein-truncating variants that can expedite drug discovery. Through simulation studies, we demonstrate that the proposed model comparison approach can improve ability to detect rare variant association signals. We also apply the model to two groups of phenotypes from the UK Biobank: 1) asthma diagnosis, eosinophil counts, forced expiratory volume, and forced vital capacity; and 2) glaucoma diagnosis, intra-ocular pressure, and corneal resistance factor. We are able to recover known associations such as the protective association between rs146597587 in IL33 and asthma. We also find evidence for novel protective associations between rare variants in ANGPTL7 and glaucoma. Overall, we show that the MRP model comparison approach is able to retain and improve upon useful features from widely-used meta-analysis approaches for rare variant association analyses and prioritize protective modifiers of disease risk. Author summary Due to the continually decreasing cost of acquiring genetic data, we are now beginning to see large collections of individuals for which we have both genetic information and trait data such as disease status, physical measurements, biomarker levels, and more. These datasets offer new opportunities to find relationships between inherited genetic variation and disease. While it is known that there are relationships between different traits, typical genetic analyses only focus on analyzing one genetic variant and one phenotype at a time. Additionally, it is difficult to identify rare genetic variants that are associated with disease due to their scarcity, even among large sample sizes. In this work, we present a method for identifying associations between genetic variation and disease that considers multiple rare variants and phenotypes at the same time. By sharing information across rare variant and phenotypes, we improve our ability to identify rare variants associated with disease compared to considering a single rare variant and a single phenotype. The method can be used to identify candidate disease genes as well as genes that might represent attractive drug targets.

Sequencing technologies are quickly transforming human genetic studies of complex 2 traits: it is increasingly possible to obtain whole genome sequence data on thousands of 3 samples at manageable costs. As a result, the genome-wide study of rare variants 4 (minor allele frequency [MAF] < 1%) and their contribution to disease susceptibility 5 and phenotype variation is now feasible [1][2][3][4]. 6 In genetic studies of diseases or continuous phenotypes, rare variants are hard to 7 assess individually due to the limited number of copies of each rare variant. Hence, to 8 boost the ability to detect a signal, evidence is usually 'aggregated' across variants. 9 When designing an 'aggregation' method, there are three questions that are usually 10 considered. First, across which biological units should variants be combined; second, 11 which variants mapping within those units should be included [5]; and third, which 12 statistical model should be used [6]? Given the widespread observations of shared 13 genetic risk factors across distinct diseases, there is also considerable motivation to use 14 gene discovery approaches that leverage the information from multiple phenotypes 15 jointly. In other words, rather than only aggregating variants that may have effects on a 16 single phenotype, we can also bring together sets of phenotypes for which a single 17 variant or sets of variants might have effects. 18 In this paper, we present a Bayesian multiple rare variants and phenotypes (MRP) 19 model comparison approach for identifying rare variant associations as an alternative to 20 current widely-used statistical tests. The MRP framework exploits correlation, scale, or 21 location (direction) of genetic effects in a broad range of rare variant association study 22 designs including: case-control; multiple diseases and shared controls; single continuous 23 phenotype; multiple continuous phenotypes; or a mixture of case-control and multiple 24 continuous phenotypes (Fig 1). MRP makes use of Bayesian model comparison, 25 whereby we compute a Bayes Factor (BF) defined as the ratio of the marginal 26 likelihoods of the observed data under two models: 1) a pre-specified null where all 27 genetic effects are zero; and 2) an alternative model where factors like correlation, scale, 28 or location of genetic effects are considered. The BF is an alternative to p-values from 29 traditional hypothesis testing. For MRP, the BF represents the statistical evidence for a 30 non-zero effect for a particular group of rare variants on the phenotype(s) of interest. 31 While many large genetic consortia collect both raw genotype and phenotype data, in 32 practice, sharing of individual genotype and phenotype data across groups is difficult to 33 achieve. To address this, MRP can take summary statistics, such as estimates of effect 34 multiple diseases with shared controls, iii) single quantitative phenotype, and iv) mixture of case-control and quantitative phenotypes. B: Diagram of factors considered in rare variant association analysis including the correlation matrices: R study (expected correlation of genetic effects among a group of studies), R var (expected correlation of genetic effects among a group of variants), and R phen (expected correlation of genetic effects among a group of phenotypes); the scale parameter for genetic variant annotation; and the location of genetic effects, which may be used to prioritize or identify protective modifiers of disease risk.

2/22
size and the corresponding standard error from typical single variant-single phenotype 35 linear or logistic regressions, as input data. Furthermore, we use insights from Liu et 36 al. [7] and Cichonska et al. [8] who suggest the use of additional summary statistics, like 37 covariance estimates across variants and studies, respectively, that would enable lossless 38 ability to detect gene-based association signals using summary statistics alone. 39 Aggregation techniques rely on variant annotations to assign variants to groups for 40 analysis. MRP allows for the inclusion of priors on the spread of effect sizes that can be 41 adjusted depending on what type of variants are included in the analysis. For instance, 42 protein truncating variants (PTVs) [9,10] are an important class of variants that are 43 more likely to be functional because they often disrupt the normal function of a gene. 44 This biological knowledge can be reflected in the choices of priors for PTVs in MRP. 45 Since PTVs typically abolish or severely alter gene function, there is particular interest 46 in identifying protective PTV modifiers of human disease risk that may serve as targets 47 for therapeutics [11][12][13] In this section, we provide an overview of the MRP model comparison approach. Refer 71 to S1 Appendix for a detailed description. MRP models GWAS summary statistics as 72 being distributed according to one of two models. The null model is that the regression 73 effect sizes obtained across all studies for a group of variants and a group of phenotypes 74 is zero. The alternative model is that summary statistics are distributed according to a 75 multivariate normal distribution with mean zero and covariance matrix described below. 76 MRP compares the evidence for the null and alternative model using a Bayes Factor 77 (BF) that quantifies the amount of evidence for each model as the ratio of the marginal 78 likelihoods of the observed data under two models.

79
To define the alternative model, we must specify the prior correlation structure, linkage-disequilibrium, effect sizes (or odds ratio), and standard error of the effect size. 86 When considering multiple studies (S > 1), multiple rare variants (M > 1), and 87 multiple phenotypes (K > 1), we define the prior correlation structure of the effect sizes 88 as an SM K × SM K matrix U. In practice, we define U as a Kronecker product, an 89 operation of matrices of arbitrary size, of three sub-matrices:

90
• an S × S matrix R study containing the correlations of genetic effects among studies 91 where different values can be used to compare different models of association, such 92 as for identifying heterogeneity of effect sizes between populations [17];

93
• an M × M matrix R var containing the correlations of genetic effects among 94 genetic variants, which may reflect the assumption that all the PTVs in a gene 95 may have the same biological consequence [9,10,18] or prior information obtained 96 through integration of additional data sources, such as functional assay 97 data [5,19], otherwise zero correlation of genetic effects may be assumed, which is 98 used in dispersion tests like C-alpha [20,21] and SKAT [14]; and

99
• a K × K matrix R phen containing the correlations of genetic effects among 100 phenotypes, which may be obtained from common variant data [22][23][24]. 101

4/22
The variance-covariance matrix of the effect sizes may be obtained from readily 102 available summary statistic data such as in-study LD matrices, effect size estimates (or 103 log odds ratios), and the standard errors of the effect size estimates (S1 Appendix within a gene, PTVs may have stronger effects than missense variants [25]. This can be 115 reflected by adjusting the prior spread of effect sizes (σ) for PTVs (S1 Appendix).

116
Similarly, we can utilize a prior on the location (direction) of effects to specify 117 alternative models where we seek to identify variants with protective effects against 118 disease. Thus far we have assumed that the prior mean, or location, of genetic effects is 119 zero which makes it feasible to analyze a large number of phenotypes without 120 enumerating the prior mean across all phenotypes. To proactively identify genetic 121 variants that have effects that are consistent with a protective profile for a disease, we 122 can include a non-zero vector as a prior mean of genetic effect (S1 Appendix). We can 123 exploit information from Mendelian randomization studies of common variants, such as 124 recent findings where rare truncating loss-of-function variants in PCSK9 were found to 125 decrease LDL and triglyceride levels and decrease CAD risk [11,[26][27][28]  Although summary statistics are quicker to read and process than raw data, the number 138 of studies meta-analyzed in this work is expected to be sufficiently large to require 139 optimizations in data representation and processing (S1 Fig). Our solution was the use 140 of the HDF5 (Hierarchical Data Format 5) data representation to enable rapid 141 processing of effect size, uncertainty, and cross-trait estimate data. HDF5 is a fast and 142 lightweight file format designed for scientific data. It has bindings for R, Python, 143 C/C++, Java, and nearly every other population programming language. Reading data 144 from a table within a HDF5 file can be an order of magnitude faster than reading text 145 files from a Unix file, and it makes it easier to organize data within an internal structure. 146 Simulation studies. A: Comparison of − log 10 (p-values) from frequentist BF MRP approximation for an independent effects and a similar effects model to commonly used gene-based statistical tests (skat and burden). B: Comparison of log10(Bayes Factors) obtained when raw genotype and phenotype data is available to a scenario where summary statistics only was available and similar effects across studies is assumed. C: From single variant and single phenotype to multiple variants and multiple phenotypes gene discovery: ROC curves for detecting gene association to any of the phenotypes using single variant/single phenotype association (green) to multiple variants and multiple phenotypes association (purple). D: ROC curves for detecting gene association when incorporating prior mean of genetic effects (orange) to identify protective alleles.

148
We performed genome-wide association analysis using PLINK v2.00a(17 July 2017) as 149 previously described [15]. For asthma, we used the Firth fallback in PLINK, a hybrid 150 algorithm which normally uses the logistic regression code described in [29], but 151 switches to a port of logistf()

153
(1) one of the cells in the 2x2 allele count by case/control status contingency variants that were specific to one array, we did not use array as a covariate. 160

6/22
Asthma and glaucoma cases were defined using both Hospital Episode Statistics and 161 verbal questionnaire responses. We used the provided values from the UK Biobank for 162 eosinophil counts, forced vital capacity (FVC), forced expiratory volume in 1-second 163 (FEV 1 ), intra-ocular pressure, and corneal resistance factor. The phenotype codes used 164 throughout (asthma=HC382, eosinophil count=INI30150, FEV 1 =INI3063, 165 FVC=INI3062, glaucoma=HC276, intra-ocular pressure=INI5263, and corneal For the Manhattan plots and tables, we removed any genes with non-unique gene 190 symbols. In cases where genes overlapped such that they shared rare variants and 191 therefore the same BF, we removed one gene. ANGPTL7 protein expression was 192 assessed using the HIPED protein expression database accessed through genecards.org 193 on 2017/1/29 [31]. We identified the protein 1JC9 A as homologous to the ANGPTL7 194 protein using the "3D structure mapping" link from dbSNP one. This removes variants with missingness greater than 1% (calculated on an 204 array-specific basis for array-specific variants) or Hardy-Weinberg equilibrium p < 10 −7 . 205 This also removes some PTVs for which manual inspection revealed irregular cluster 206 plots [15]. We LD pruned the variants by only using variants with ld equal to one. We

211
Simulation studies 212 We first verified the analytical derivations and examined the properties of the approach 213 under a simulation framework.

214
Comparison to frequentist gene tests 215 For the analysis of multiple rare variants and a single phenotype we compared it to the 216 burden test and the SKAT test, commonly used statistical tests in rare variant 217 association studies of a single phenotype. We observe concordance between the 218 frequentist methods and the Bayesian models. To compare the Bayesian models we 219 compute p-values by using the BF as the test statistic and approximating it using 220 distribution properties of quadratic forms (S1 Appendix). As expected, an independent 221 effects model has high correlation with the gene-based test SKAT (r 2 = 0.99), whereas 222 the similar effects model has high correlation with the burden test (r 2 = 0.93, Fig 2A). 223

Summary statistic data 224
To study the behavior of MRP using summary statistics we simulate two scenarios: first, 225 the scenario where analysts have access to all the raw genotype and phenotype data; 226 and second, the scenario where analysts only have access to summary statistics data [7]. 227 We conducted 1000 simulation experiments where we let K (the number of phenotypes) 228 To validate the flexibility of the approach we conducted a simulation experiment where 237 we assumed an allelic architecture consistent to that discovered for APOC3 in relation 238 to coronary artery disease (CAD), triglycerides (TG), low-density lipoprotein cholesterol 239 (LDL-C), and high-density lipoprotein cholesterol (HDL-C) [28,[32][33][34]. We simulated 240 three studies and applied the model comparison unit jointly to summary statistic data 241 obtained for each study (Supplementary Note). Overall, we observed that considering 242 the joint effects across multiple studies in a group of variants and phenotypes may 243 improve ability to detect gene-based signals (Fig 2C), and that considering prior mean 244 of genetic effects should aid in efforts to identify protective modifiers of disease risk 245 (Fig 2D). We applied the MRP model comparison approach to summary statistic data generated 248 from single variant logistic regression and linear regression analysis for coding variants 249 8/22 on the UK Biobank array (Methods). We applied MRP separately to asthma and three 250 related traits as well as glaucoma and two related traits.

255
Recent work has identified associations between the PTV rs146597587 in IL33 and 256 asthma and eosinophil counts [15,16]. FEV 1 and FVC are measures of pulmonary 257 function that are used to diagnosis and classify pulmonary disease [35]. To demonstrate 258 the advantage of considering the phenotypes jointly, we applied MRP to rare missense 259 variants and PTVs (MAF < 1%) for each phenotype separately (Fig 3A-D) as well as to 260 all phenotypes jointly (Fig 3E,F) and obtained log 10 BF for each gene. We applied both 261 independent and similar effects models and used Bayesian model averaging to compute 262 a single BF per gene [36]. In agreement with previous studies, we observed evidence 263 that rare missense variants and/or PTVs in IL33 affect eosinophil counts and offer 264 protection from asthma from the single-phenotype analyses, though the evidence of 265 association was strongest for the joint analysis (S1 Table) [15,16]. We performed an 266 analysis focused on identifying protective variants which also identified the IL33 267 association (Fig 3F). The results were similar using only either the independent effects 268 (S3 Fig) or similar effects models (S4 Fig). We inspected the effect sizes from the 269 marginal GWAS regressions for the rare variants included in the analysis and found that 270 the association identified by MRP is likely driven by the PTV rs146597587 (Fig 3G). 271 We also found moderate evidence for association between rare coding variants in 272 CCR3 and asthma. The log 10 BFs for CCR3 was 3.3 in the joint model compared to 273 only -0.5 in the asthma-only analysis (Fig 3, S1 Table). CCR3 is a chemokine receptor 274 that is highly expressed on eosinophils and has been a therapeutic focus for 275 asthma [37,38]. CCR3 was not reported in a large GWAS for allergic disease including 276 asthma [39] though CCR3 is near a locus associated with atopy in a previous 277 meta-analysis [40]. These results demonstrate that MRP can identify biologically 278 meaningful therapeutic targets that may be missed by standard GWAS approaches.

279
Considering multiple phenotypes jointly allows for the efficient prioritization of 280 disease genes. For instance, some genes like IL18RAP, ATP2A3, and FLG had log 10 281 BFs greater than 4 in the asthma-only analysis but much smaller BFs in the joint 282 analyses indicating that rare variants in these genes are less likely to affect this group of 283 traits. Similarly, there were other genes like RP11-39K24.9 and IL17RA that had larger 284 BFs in the eosinophil count-only analysis but small BFs for the joint analyses 285 demonstrating MRP's ability to integrate information across all phenotypes considered. 286 Glaucoma, intra-ocular pressure, and corneal resistance factor 287 We also applied MRP to missense variants and PTVs for glaucoma, intra-ocular 288 pressure, and corneal resistance factor as well as performing joint analyses. Intra-ocular 289 pressure is a measure of the fluid pressure in the eye, is associated with glaucoma risk, 290 and has been linked to genetic variants associated with glaucoma [41]. Corneal 291 resistance factor is a measure of the cornea's ability to resist mechanical stress and has 292 been associated with glaucoma presence and severity [42][43][44]. While the individual 293 glaucoma analysis did not yield any associations with log 10 BF greater than three, the 294 joint analysis identified rare coding variants in both ANGPTL7 and WNT10A as 295 associated with glaucoma (Fig 4A-D, S2 Table). Applying the MRP model identified 296 ANGPTL7 with effects consistent with protection to glaucoma, and additional  Forced vital capacity (FVC) log10 BF < 2 2 log10 BF < 3 log10 BF > 3  Expression of ANGPTL7 is upregulated in glaucoma and has been proposed to 300 regulate intra-ocular pressure and glaucoma risk [45,46]. The GWAS summary statistics 301 for the rare variants in ANGPTL7 suggest that the association with glaucoma is driven 302 by the missense variant rs28991009 that changes residue 175 from glutamine to histidine 303 (Fig 4F, G). According to the HIPED protein expression database, ANGPTL7 protein 304 is expressed at ∼ 0.7 parts per million in vitreous humor, the material between the lens 305 and retina of the eyeball; in contrast, the expression of ANGPTL7 protein is less than 306 0.01 parts per million in 68 other normal tissues [31]. Such tissue-specific activity may 307 10/22 make ANGPTL7 a useful therapeutic target. WNT10A has not been previously 308 associated with glaucoma though an exonic variant rs121908120 in WNT10A is 309 associated with central cornea thickness and increased risk of keratoconus, a disease of 310 the cornea, indicating that this gene may play a role in ocular diseases.

11/22
Discussion 312 In this study, we developed a Bayesian model comparison approach MRP that shares 313 information across both variants and phenotypes to identify rare variant associations. 314 We used simulations to compare MRP to the widely used burden and SKAT tests for 315 identifying rare variant associations and found that jointly considering both variants 316 and phenotypes can improve the ability to detect associations. We also applied the 317 MRP model comparison framework to summary statistic data from two groups of traits 318 from the UK Biobank: asthma diagnosis, eosinophil counts, FEV 1 , and FVC; and 319 glaucoma diagnosis, intra-ocular pressure, and corneal resistance factor. We identified 320 strong evidence for the previously described association between the PTV rs146597587 321 in IL33 and asthma [15,16]. We also found evidence for a link between rare variants in 322 ANGPTL7 and glaucoma, consistent with previous experiments that suggested a role 323 for ANGPTL7 in glaucoma [45,46]. These results demonstrate the ability of the MRP 324 model comparison approach to leverage information across multiple phenotypes and 325 variants to discover rare variant associations.

326
As genetic data linked to high-dimensional phenotype data continues to be made 327 available through biobanks, health systems, and research programs, there is a large need 328 for statistical approaches that can leverage information across different genetic variants, 329 phenotypes, and studies to make strong inferences about disease-associated genes.  Eosinophill count log10 BF < 2 2 log10 BF < 3 log10 BF > 3 Eosinophill count log10 BF < 2 2 log10 BF < 3 log10 BF > 3    Table. Highlighted genes from glaucoma analysis. log 10 Bayes Factors for 392 genes highlighted in Figure 4.