## Abstract

Genome wide association studies have identified numerous regions in the genome associated with hundreds of human diseases. Building accurate genetic risk prediction models from these data will have great impacts on disease prevention and treatment strategies. However, prediction accuracy remains moderate for most diseases, which is largely due to the challenges in identifying all the disease-associated variants and accurately estimating their effect sizes. We introduce AnnoPred, a principled framework that incorporates diverse functional annotation data to improve risk prediction accuracy, and demonstrate its performance on multiple human complex diseases.

## Main

Achieving accurate disease risk prediction using genetic information is a major goal in human genetics research and precision medicine. Accurate prediction models will have great impacts on disease prevention and early treatment strategies [1]. Advancements in high-throughput genotyping technologies and imputation techniques have greatly accelerated discoveries in genome-wide association studies (GWAS) [2]. Various approaches that utilize genome-wide data in genetic risk prediction have been proposed, including machine-learning models trained on individual-level genotype and phenotype data [3–⇓⇓⇓⇓8], and polygenic risk scores (PRS) estimated using GWAS summary statistics [9, 10]. Despite the potential information loss in summary data, PRS-based approaches have been widely adopted in practice since the summary statistics for large-scale association studies are often easily accessible [11, 12]. However, prediction accuracies for most complex diseases remain moderate, which is largely due to the challenges in both identifying all the functionally relevant variants and accurately estimating their effect sizes in the presence of linkage disequilibrium (LD) [13].

Explicit modeling and incorporation of external information, e.g. pleiotropy [7, 8] and LD [10], has been shown to effectively improve risk prediction accuracy. Recent advancements in integrative genomic functional annotation, coupled with the rich collection of summary statistics from GWAS, have enabled increase of statistical power in several different settings [14, 15]. To our knowledge, the impact of functional annotations on performance of genetic risk prediction has not been systematically studied.

Here, we introduce AnnoPred (available at https://github.com/yiminghu/AnnoPred), a principled framework that integrates GWAS summary statistics with various types of annotation data to improve risk prediction accuracy. We compare AnnoPred with state-of-the-art PRS-based approaches and demonstrate its consistent improvement in risk prediction performance using both simulations and real data of multiple human complex diseases.

AnnoPred risk prediction framework has three main stages (**Methods**). First, we estimate GWAS signal enrichment in 61 different annotation categories, including functional genome predicted by GenoCanyon scores [14], GenoSkyline tissue-specific functionality scores of 7 tissue types [15], and 53 baseline annotations [16]. Second, we propose an empirical prior of SNP effect size based on annotation assignment and signal enrichment. In general, SNPs located in annotation categories that are highly enriched for GWAS signals receive a higher effect size prior. Finally, the empirical prior is adopted in a Bayesian framework in which marginal summary statistics and LD matrix are jointly modeled to infer the posterior effect size of each SNP. AnnoPred PRS is defined by
where *X _{j}* and

*β*are the standardized genotype and effect size of the j

_{j}^{th}SNP, respectively, is the marginal estimate of

*β*, is the sample LD matrix, and

*E*denotes the posterior expectation under an empirical prior based on annotation assignment for all SNPs in the dataset (

_{A}**Methods**).

We first performed simulations to demonstrate AnnoPred’s ability to improve risk prediction accuracy. We compared AnnoPred with four popular PRS approaches (**Methods**), i.e. PRS based on genome-wide significant SNPs (PRS_{sig}), PRS based on all SNPs in the dataset (PRS_{all}), PRS based on tuned cutoffs for p-values and LD pruning (PRS_{P+T}), and recently proposed LDpred [10]. Mean correlations between simulated and predicted traits were calculated from 100 replicates under different simulation settings (Methods). AnnoPred showed the best prediction performance in all settings (**Table 1**). In general, performance of PRS_{sig}, PRS_{P+T}, LDpred, and AnnoPred all improved under a sparser genetic model and higher trait heritability. PRS_{all} showed comparable performance between sparse and polygenic models but its prediction accuracy was consistently worse than other methods. Sample size in the training set was also crucial for risk prediction accuracy. Doubling the training samples led to about 1.5-fold increase in AnnoPred’s performance under different settings in our simulations.

To further illustrate the improvement in risk prediction performance, we applied AnnoPred to five human complex diseases -- Crohn’s disease (CD), breast cancer (BC), rheumatoid arthritis (RA), type-II diabetes (T2D), and celiac disease (CEL). We first estimated GWAS signal enrichment in different annotation categories (**Methods**). Enrichment pattern varies greatly across diseases (**Figure 1A**; **Supplementary Table 1**), reflecting the genetic basis of these complex phenotypes. Functional genome predicted by GenoCanyon scores was consistently and significantly enriched for all five diseases. Blood was strongly enriched for three immune diseases, namely CD (P=8.9×10^{−12}), CEL (P=7.0×10^{−15}), and RA (P=9.9×10^{−6}), while gastrointestinal (GI) tract was enriched in CD (P=2.6×10^{−5}) and CEL (P=1.4×10^{−4}), both of which have a known GI component. For BC, epithelium (P=7.4×10^{−4}), GI (P=5.9×10^{−3}), and muscle (P=6.1×10^{−3}) were significantly enriched. Next, we evaluated the effectiveness of proposed empirical effect size prior in three diseases (i.e. CD, CEL, and RA) with well-powered testing cohorts (N>2,000). Interestingly, despite the highly variable enrichment results in training datasets, integrative effect size prior could effectively identify SNPs with large effect sizes and consistent effect directions in independent validation cohorts (**Figures 1B and 1C**).

Area under the receiver operating characteristic curve (AUC) for different approaches is summarized in **Table 2**. AnnoPred showed consistently improved prediction accuracy compared with all other methods across five diseases. Notably, PRS_{sig} and PRS_{all} showed suboptimal performance in these datasets, reaffirming the importance of modeling LD and other external information. To test different methods’ ability to stratify individuals with high risk, we compared the proportion of cases among testing samples with high PRS. AnnoPred outperformed all other methods in CD, CEL, RA, and T2D (**Supplementary Figure 1**). Next, we tested AnnoPred’s performance using only the 53 baseline annotations and observed a substantial drop in prediction accuracy for all diseases (**Supplementary Table 2**). These results highlight the importance of annotation quality in genetic risk prediction, and also demonstrate GenoCanyon and GenoSkyline’s ability to accurately identify functionality in the human genome.

Due to distinct allele frequencies and LD structures across populations, risk prediction accuracy usually drops when the training and testing samples are from different populations. In order to investigate the robustness of AnnoPred against population heterogeneity, we applied AnnoPred to three non-European cohorts for breast cancer and type-II diabetes while training the model using summary statistics from European-based studies. The AUCs are summarized in **Supplementary Table 3**. As expected, we observed a drop in prediction accuracy for all methods. However, AnnoPred still performed the best in all three trans-ethnic validation datasets.

Our work demonstrates that functional annotations can effectively improve performance of genetic risk prediction. AnnoPred jointly analyzes diverse types of annotation data and GWAS summary statistics to provide accurate estimates of SNP effect sizes, which lead to consistently better prediction accuracy for multiple complex diseases. Our method is not without limitation. First, despite the consistent improvement compared with existing PRS-based methods, AUCs for most diseases remain moderate. In order to effectively stratify risk groups for clinical usage, our model remains to be further calibrated using large cohorts with measured environmental and clinical risk factors [1]. Second, accurate estimation of GWAS signal enrichment and SNP effect sizes requires a large sample size for the training dataset. This could be potentially improved by better estimators for annotation-stratified heritability in the future [17]. The rich collection of publicly available integrative annotation data, in conjunction with the increasing accessibility of GWAS summary statistics, makes AnnoPred a customizable and powerful tool. As GWAS sample size continues to grow, AnnoPred has the potential to achieve even better prediction accuracy and become widely adopted as a summary of genetic contribution in clinical applications of risk prediction.

## Methods

### Annotation data

We incorporated GenoCanyon general functionality scores [14], GenoSkyline tissue-specific functionality scores for seven tissue types (brain, gastrointestinal tract, lung, heart, blood, muscle, and epithelium) [15], and 53 LDSC baseline annotations [16] into our model (**Supplementary Table 1**). We smoothened GenoCanyon annotation by taking the mean GenoCanyon score using a 10Kb window as previously suggested [18]. The smoothened GenoCanyon annotation and raw GenoSkyline annotations of seven tissue types were dichotomized based on a cutoff of 0.5. The regions with GenoCanyon or GenoSkyline scores greater than the cutoff were interpreted as non-tissue-specific or tissue-specific functional regions in the human genome. Such dichotomization has been previously shown to be robust against the cutoff choice [15]. Notably, the AnnoPred framework allows users to specify their own choice of annotations.

### Heritability partition

We assume throughout the paper that both the phenotype *Y _{N×l}* and the genotypes

*X*are standardized with mean zero and variance one. We assume a linear model.

_{N×M}*X*,

*β*and

*ɛ*are mutually independent. We also assume that

*β*is a random effect and effects of different SNPs are independent. A key idea in the AnnoPred framework is to utilize functional annotation information to accurately estimate SNPs’ effect sizes. In order to achieve that, we first partition trait heritability by annotations using LD score regression [16]. Per-SNP heritability is defined as the variance of

*β*for the i

_{i}^{th}SNP, and is used to quantify SNP effect sizes. More specifically, assume there are

*K*+ 1 predefined annotation categories, denoted as

*S*

_{0},

*S*

_{1},…,

*S*with

_{K}*S*

_{0}representing the entire genome. Under an additive assumption for heritability in overlapped annotations, we have , where τ

_{0}, τ

_{1},…, τ

_{k}quantify the contribution to per-SNP heritability from each annotation category. Denote the estimated marginal effect size of the i

^{th}SNP as , then we have the following approximation where

*l*(

*i,k*) is the annotation-stratified LD score and

*N*denotes the total sample size. Regression coefficients τ

_{k}are estimated through weighted least squares. The estimated heritability of the i

^{th}SNP is then

### Empirical prior of effect size

Based on per-SNP heritability estimates, we propose two different priors for SNP effect sizes to add flexibility against different genetic architecture. For the first prior, we assume SNP effect size follows a spike-and-slab distribution
where *p*_{0} is the proportion of causal SNPs in the dataset, and *δ*_{0} is a Dirac function representing a point mass at zero. The empirical variance of each SNP, i.e. , is determined by the annotation categories it falls in. More specifically, we assume, where *c* is a constant calculated from the following equation

We do not directly use as the empirical variance prior because it is estimated in the context where all SNPs in the 1000 genomes database are included in the model [16]. Such per-SNP heritability estimates cannot be extrapolated to the risk prediction context where much fewer SNPs are analyzed [19]. Therefore, we rescale the heritability estimates to better quantify each SNP’s contribution toward chip heritability. Following [20], we use a summary statistics-based heritability estimator that approximates Haseman-Elston estimator: where and denote mean and mean non-stratified LD score, respectively.

In the first prior, we assumed the same proportion of causal SNPs but different effect sizes across annotation categories. We now describe the second prior that assumes different proportions of but the same effect size for causal SNPs. To be specific, we assume causal effect size to be *Var*(*β _{causal}*) =

*V*, the total number of SNPs to be

*M*

_{0}, and the overall proportion of causal SNPs to be

*p*

_{0}. The total heritability could then be written as . For the i

^{th}SNP, use to denote the collection of SNPs that share the same annotation assignment with the i

^{th}SNP, and let

*M*= |

_{Ti}*T*|, i.e. number of SNPs in the set. Then, the total heritability of SNPs in

_{i}*T*is , with denoting the proportion of causal SNPs in

_{i}*T*. Following these notations, we have where and . We use to estimate , and use the following formula to estimate .

_{i}Finally, *p*_{0} is treated as a tuning parameter for both prior functions in our analysis.

### Calculation of posterior effect sizes

By Bayes’ rule, the posterior distribution of *β* is:
where is the sample correlation matrix and is the marginal effect size estimates. Given *β* and , follows a multivariate normal distribution asymptotically with the following mean and variance

However, is usually non-invertible and has very high dimensions. We thus study the posterior distribution of a small chunk of instead. Let be the estimated marginal effect size of SNPs in a region *b* (e.g. a LD block) and the corresponding genotype matrix is *X _{b}* and sample correlation matrix is . Then the conditional mean and variance of are
where is the heritability of SNPs in region

*b*, and

*X*and

_{−b}*β*denote the genotype matrix and effect sizes of SNPs not in region

_{−b}*b*. The conditional distribution of

*β*

_{b}

Although it is difficult to derive from the joint conditional distribution of β_{b}, each element of β_{b} follows a mixed normal distribution conditioning on , , and all other elements in β_{b}. Therefore, we could apply a Gibbs sampler to draw samples from , D and use the sample mean as an approximation for .

### Calculation of PRS

PRS is calculated using the following formula
where *E _{A}* denotes the posterior expectation as described above. In practice, the individual-level genotype matrix is not available and we use the LD matrix estimated from a reference panel or the validation samples to substitute . We apply the same standard of choosing the size of

*b*as described in [10]. Choices of prior and

*p*

_{0}can be tuned in an independent cohort. For the data analysis described in this work, we adopted a cross-validation scheme. We tuned parameters using half of the testing samples and evaluated prediction accuracy using the other half, and then repeated the analysis after reversing the two sample subsets. Finally, we reported the mean AUC of two crossvalidations.

### Other methods for comparison

We compared AnnoPred with four commonly used risk prediction methods based on summary data of association studies. PRS_{sig} and PRS_{all} were both calculated as the inner product of marginal effect size estimates and the corresponding genotypes. PRS_{all} used all the SNPs that are shared between training and testing datasets while PRS_{sig} only used SNPs with p-values below 5×10^{−8} in the training set. We downloaded python code for PRS_{P+T} and LDpred from Bitbucket (http://bitbucket.org/bjarni_vilhjalmsson/ldpred).All the tuning parameters were tuned through cross-validation as we did for AnnoPred.

### Simulation settings

We simulated traits from WTCCC genotype data, which contain 15,918 individuals genotyped for 393,273 SNPs after filtering variants with missing rate above 1% and individuals with genetic relatedness above 0.05. We first generated two annotations and each annotation was simulated by randomly selecting 10% of the genome, denoted as *A*_{1} and *A*_{2}. Denote the heritability of the trait as and the number of causal variants as *m* (300 or 3,000). Causal variants were generated as follows:*m*/3 causal variants were selected from *A*_{1}, *m*/3 from *A*_{2} and the rest from (*A*_{1}⋃*A*_{2})* ^{C}*. Effect sizes of causal variants were sampled from . For each simulation, we used 70% of the data to calculate the training summary statistics and randomly divided the rest 30% into two parts for parameter tuning. We also randomly selected half of the training data to calculate summary statistics in order to study the effect of sample size on prediction accuracy.

### GWAS summary statistics and validation data

We trained AnnoPred using publicly accessible GWAS summary statistics and evaluated risk prediction performance using individual-level genotype and phenotype data from cohorts independent from the training samples. Details for each training and testing dataset are provided in **Supplementary Notes** and **Supplementary Table 4**.

For Crohn’s disease, we trained the model using summary statistics from International Inflammatory Bowel Disease Genetics Consortium (IIBDGC; N_{case}=6,333 and N_{control}=15,056) [21]. Samples from the Wellcome Trust Case Control Consortium (WTCCC) were removed from the meta-analysis and used as the validation dataset (N_{case}=1,689 and N_{control}=2,891) [22]. For breast cancer, we trained the model using summary statistics from Genetic Associations and Mechanisms in Oncology (GAME-ON) study (N_{case}=16,003 and N_{control}=41,335) [23], and tested the performance using samples from the Cancer Genetic Markers of Susceptibility (CGEMS) study (N_{case}=966 and N_{control}=70) [24]. Shared samples between CGEMS and GAME-ON were removed. We used samples from the CIDR-GWAS of breast cancer for trans-ethnic analysis (N_{case}=1,666 and N_{control}=2,038) [25]. For rheumatoid arthritis, we used summary statistics from a meta-analysis with 5,539 cases and 20,169 controls to train the model [26]. WTCCC samples were removed from the meta-analysis and used for validation (N_{case}=1,829 and N_{control}=2,892) [22]. For type-II diabetes, the training dataset is Diabetes Genetics Replication and Meta-analysis (DIAGRAM) consortium GWAS with 12,171 cases and 56,862 controls [27]. We used samples from Northwestern NUgene Project for validation (N_{case}=662 and N_{control}=517) [28]. Samples from Institute for Personalized Medicine (IPM) eMERGE project are used for trans-ethnic analysis (African American:N_{case}=517 and N_{control}=213; Hispanic:N_{case}=477 and N_{control}=102) [29]. The training dataset for celiac disease is from a GWAS with 4,533 cases and 10,750 controls [30]. Samples in the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) celiac disease study were used for validation (N_{case}=1,716 and N_{control}=530)[31].

## Software availability

AnnoPred software and source code are freely available online at https://github.com/yiminghu/AnnoPred

## Acknowledgements

This study was supported in part by the National Institutes of Health grants R01 GM59507, the VA Cooperative Studies Program of the Department of Veterans Affairs, Office of Research and Development, and the Yale World Scholars Program sponsored by the China Scholarship Council. We also sincerely thank DIAGRAM, GAME-ON, IIBDGC, and ImmunoBase for making their GWAS summary data publicly accessible. This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113, 085475 and 090355. We also thank Dr. Bjarni J. Vilhjalmsson for sharing his codes.

### Author Contributions

Y.H., Q.L., H.Z. conceived the project and developed the model. Y.H., R.L.P, X.Y. developed the software. Y.H., Q.L., F.F., X.X. performed the analyses. C.Y. contributed collecting and curating data. H.Z. advised on statistical and genetic issues. Y.H., Q.L., and H.Z. wrote the manuscript, and all authors contributed to editing of the manuscript.

### Competing Financial Interests

The authors declare no competing financial interests.