Cancer PRSweb – an Online Repository with Polygenic Risk Scores (PRS) for Major Cancer Traits and Their Phenome-wide Exploration in Two Independent Biobanks

Lars G. Fritsche; Snehal Patil; Lauren J. Beesley; Peter VandeHaar; Maxwell Salvatore; Robert B. Peng; Daniel Taliun; Xiang Zhou; Bhramar Mukherjee

doi:10.1101/2020.01.22.915751

Abstract

To facilitate scientific collaboration on polygenic risk scores (PRS) research, we created an extensive PRS online repository for 49 common cancer traits integrating freely available genome-wide association studies (GWAS) summary statistics from three sources: published GWAS, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWAS. Our framework condenses these summary statistics into PRS using various approaches such as linkage disequilibrium pruning / p-value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRS in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance, calibration, and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRS. We expect this integrated platform to accelerate PRS-related cancer research.

Introduction

Since 2005, genome-wide association studies (GWAS) have successfully uncovered many common genetic variants associated with a plethora of complex traits and disorders [1–3]. Translation of these findings into clinical practice to improve pre-symptomatic screening and patient care is a major aspiration in the research community. However, genetic risk factors for complex diseases like cancer usually have relatively small risk effects and/or low frequencies and thus only have limited ability as individual predictors of risk in the overall population. Alternatively, the integration of all common risk variants into a single biomarker, called a polygenic risk score (PRS), represents a widely used approach for potentially identifying high-risk individuals at the highest levels of PRS [4–6]. For example, it was shown that PRS for five common complex diseases (coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer) have the potential to detect individuals at significantly higher genetic risk [4] that might benefit from intensified screening efforts, prophylactic prevention or earlier treatment. Several challenges have to be overcome for constructing a PRS that incorporates state of the art scientific knowledge: one needs (1) summary statistics from an independent discovery GWAS with phenotype and ancestry matching the target study [7]; (2) individual-level genetic data of a sufficiently large cohort to correct for linkage disequilibrium (LD) between genetic variants; and (3) a computationally efficient method to calculate each PRS and to find the best PRS construct for the target cohort.

The gold standards for GWAS to define PRS constructs are independent, large GWAS analyses or GWAS meta-analyses. Full summary statistics enable exploration of the complete spectrum of PRS construction methods, e.g., those that determine the optimal inclusion p-value threshold of risk variants for prediction, which often deviates from the standard threshold for genome-wide significance (P-value ≤ 5×10^-8). So far, several cancer GWAS research groups and consortia have openly shared their full GWAS summary statistics with the research community: ovarian carcinoma [8, 9], breast cancer [10, 11], prostate cancer [12], colorectal cancer [13], and cervical carcinoma [14]. Other groups have released variants that reached an arbitrarily chosen p-value threshold below genome-wide significance (e.g., P-value < 10^-5)[15]. In addition to complete or partial GWAS summary statistics, lists of genome-wide significant hits are available for nearly all published GWAS results. The NHGRI-EBI GWAS Catalog [2] (https://www.ebi.ac.uk/gwas/) curates and stores published risk variants for a plethora of traits in a structured database, offering a convenient and efficient way to extract GWAS hits for automated processing.

Alternative and growing sources for publicly available GWAS summary statistics across a large ensemble of diseases use UK Biobank genotype and phenotype data [16], adjusting for population stratification and/or relatedness between individuals ([17]; http://www.nealelab.is/uk-biobank and https://www.leelabsg.org/resources). These biobank-based approaches accessed thousands of phenotypes and traits that were defined in efficient automated fashion, e.g., by ICD10 diagnosis category, with specific phenotype defining algorithms like PHESANT [18] or PHEWAS CODES [19], or even with consortium-based curated phenotype constructs using the content of the electronic health records (EHR) (FINNGEN; https://www.finngen.fi/en/researchers/clinical-endpoints).

Another important aspect of finding a suitable set of GWAS summary statistics for a PRS is the mapping of the discovery GWAS trait, here the cancer phenotype, with the trait of interest in the target study. GWAS efforts usually balance specificity and sample size to maximize power for discovery. Consequently, the analyzed phenotype definition might not necessarily represent an ideal match to the phenotype of the target study. Also, differences in diagnosis coding practices in EHR systems, e.g., the preference for certain diagnoses due to billing purposes, might limit the transferability of phenotype definitions across cohorts, even if the same coding systems were used [20].

The simplest form of PRS construction requires two things: a selected set of independent risk variants with estimated or weighted risk effect sizes (say ), and genotype data of individuals genotyped at the selected sites (say G_i where i ∈ a list L). A PRS can then be calculated for each individual as the sum of the weighted risk increasing alleles, namely .

PRS construction methods and their underlying variant selection procedures can roughly be categorized into four groups: (i) fixed P-value thresholds of independent risk variants, e.g., “GWAS hits,” variants that reached genome-wide significance (with p < 5 × 10^-8; (ii) LD pruning (actually clumping) / P-value thresholding (“P&T”) of summary statistics that increases power by determining the most predictive P-value cut-off that can be above or below genome-significance [6]; (iii) genome-wide PRS that consider the full GWAS summary statistics after modeling LD, applying shrinkage or Bayesian approaches, e.g., LDpred and lassosum, [21–24] and (iv) methods that use individual-level data from a GWAS to determine an optimal set of independent predictors through Bayesian spike and slab or mixture priors [25]. The first two approaches typically use the originally reported effect sizes for weighting, while the latter two approaches model LD and/or shrink effect sizes. All methods require a reference panel for LD estimation that ideally resembles or matches the genotype data underlying the discovery GWAS source. Since most only have summary statistics and not individual-level data of the discovery study, we will use only the first three approaches for PRS construction, i.e., fixed P-value thresholds, LD pruning / P-value thresholding, and Lassosum.

PRS have increasingly been used in cancer risk prediction and stratification. A brief survey of PRS related literature in PubMed shows that ∼15% of all PRS articles are related to cancer, with 67% of cancer PRS papers focusing on common cancers (defined by the US National Cancer Institute (NCI; https://www.cancer.gov/types/common-cancers; estimated incidence of 40,000 or more in the United States in 2019). As of November 9, 173 PubMed articles on PRS and cancer have been published in 2019, more than double the previous high of 86 set in 2018, indicating the rapid growth in collection, curation, and generation of genetic data. These studies typically employ construction methods (i) and (ii) as described above, although joint variant models are becoming more common because they generally outperform methods (i) and (ii) and advanced software has made joint modeling more computationally efficient for large sample sizes [26, 27]. Several publications constructed PRS for cancer traits using different methods [28–30] and described their PRS methodology. However, very few share the variants selected and their corresponding weights, making it a challenge to compare or replicate PRS results in different cohorts. The Polygenic Score Catalog (http://www.pgscatalog.org) is a resource under active development to help researchers share, apply, and evaluate PRS. However, this resource primarily relies on external PRS sources and currently considers only 21 traits (including only four cancer traits: ovarian, colorectal, prostate and breast cancer), and no validation is carried out in large biobanks.

The primary goal of this study is the generation of PRS constructs for common groupings of cancer by using published, freely available cancer GWAS summary statistics and established PRS methods and genetic data from two large biobanks: the Michigan Genomics Initiative (MGI) and the UK Biobank (UKB) (Table 1). We explore hundreds of PRS constructs and offer optimized predictive PRS (in terms of maximal increase in an R²-type metric) for 49 cancers. The resulting repository of cancer PRS is made available online via an interactive platform, called Cancer PRSweb (http://prsweb.sph.umich.edu). In this platform, we accompany each GWAS source / PRS method combination with its downloadable constructs and performance metrics (like area under the receiver operating curve, tail enrichment, and Brier score), and we offer insights into secondary trait associations through screening of hundreds of cancer and non-cancer phenotypes of the EHR-derived phenomes of MGI and UKB (Table 1). We also make the summary statistics for the phenome-wide association study (PheWAS) available. Thus, this centralized and unified platform is a timely attempt to accelerate cancer research related to PRS.

View this table:

Table 1.

Demographics and clinical characteristics of the analytic datasets. The provided characteristics are based on the European (MGI) and White British (UKB) subjects for which phenotype and imputed genotype data were available.

Our repository contributes to the new and necessary work of democratizing PRS constructions and applications for several cancers under a uniform analytic framework to eventually develop transferable risk scores with clinical utility. We also offer phenome-wide exploration of PRS association through PRS-PheWAS, a tool previously introduced by this group [31, 32].

Results

PRS Construction

We screened the GWAS Catalog, PubMed, and UK Biobank GWAS efforts for any cancer GWAS summary statistics that were reported for European ancestry, to match the predominantly European cohorts of MGI and UKB, and that were openly available, i.e., did not require contacting the main authors or any form of written approval process. We identified 232 source sets that reported complete information for each tested single nucleotide polymorphisms (SNP) (position [and/or dbSNP ID], effect allele, effect estimate, p-value, and, ideally, effect allele frequency). We obtained 188 SNP sets based on UKB GWAS, 29 based on excerpts from the GWAS Catalog, and 20 from large GWAS or GWAS meta-analyses (Table S1 & S2).

We manually matched the traits of the identified cancer GWAS to cancer traits of MGI and UKB PheWAS-codes and analyzed each GWAS source separately, generating PRS for each. The discovery GWAS traits of the 232 source sets approximated 68 cancer PheWAS-codes of the MGI phenome and 21 PheWAS-codes in the UKB phenome (Table S1 & S2). Following the scheme in Figure 1 and Table 2, we generated PRS using the “P & T” and/or “lassosum” approach and also generated PRS using fixed P-value thresholds after LD clumping (P-value ≤ 5×10^-5, 5×10^-6, 5×10^-7, 5×10^-8 [“GWAS Hits”], or 5×10^-9). Using these methods and the available GWAS sources, we generated a total of 1,292 PRS (1,080 PRS for the MGI cohort and 212 PRS for the UKB cohort) (Table S3).

Figure 1.

Schematic overview of PRS generation and analysis.

View this table:

Table 2.

Overview of GWAS sources and PRS construction methods. Multiple PRS were constructed per trait of interest depending on availability of GWAS summary statistics.

PRS Evaluation

We tested the association between each PRS and its corresponding cancer trait and evaluated each PRS in terms of performance (pseudo-R²), accuracy (Brier score), discrimination (area under the receiver operating characteristic curve [AUC]), and calibration (Hosmer-Lemeshow test). Finally, we tested their utility for risk stratification, i.e., their ability to enrich cases in five selected top percentiles (1, 2, 5, 10, and 25%) versus the rest of the PRS distribution (Figure 1).

As an initial filtering step, we removed 625 PRS (48% of total PRS considered) that were not significantly (603 PRS with P > 0.05) or negatively (241 PRS) associated with their corresponding cancer trait in MGI and/or UKB. The majority of these filtered PRS were either based on discovery GWAS with small sample sizes that often did not identify any genome-wide significant hits or were evaluated for diseases with few cases or both, indicating a potential lack of power in our analysis. A total of 667 PRS for 49 different cancer traits were positively and significantly associated with their corresponding cancers in MGI (478 PRS; 49 cancer traits) and UKB (189 PRS; 19 cancer traits) (Table S3).

Comparison of Performance Metrics

In general, we found that the ranking by pseudo-R² ensured strong performance across other metrics related to discrimination, accuracy, and overall association of PRS constructs for their specific cancers. Conversely, the enrichment analyses in the extreme PRS percentiles (e.g., top 5% versus rest) was not always concordant with the selection of optimal PRS based on pseudo-R², showing that performance in the extreme tails could be optimized by a modified criterion that focuses on extremes of the risk distribution [33].

An example evaluation is shown in Table 3. Here we compare PRS across seven construction methods (lassosum, P&T, and five fixed P-value thresholds) that were all based on a single summary statistics source, a large GWAS on overall breast cancer [11]. In MGI, we observed that the lassosum-based PRS (44,815 SNPs) had the best performance (highest pseudo-R² = 0.057), the highest accuracy (Brier score = 0.137), the best discrimination between breast cancer cases and controls (AUC = 0.635 [95% confidence interval (CI): 0.624, 0.647]), and showed the strongest association with breast cancer itself (odds ratio [OR]_{continuous PRS} = 1.66 [95% CI: 1.58,1.73]). In this scenario, modeling LD information with lassosum retained more information than LD clumping [22], even though, unlike the other methods, lassosum only considered autosomal variants.

View this table:

Table 3.

Comparison of PRS methods on Breast Cancer PRS Performance in MGI (2,605 breast cancer cases and 12,548 controls). PRS are based on the BCAC Consortium GWAS on overall breast cancer [11]. Shaded cells indicate best performing PRS according to the corresponding metrics for MGI or UKB.

The enrichment of cases in the top 1% compared to the rest was more pronounced for the “GWAS hits” PRS (Fixed Threshold P <= 5e-08; 334 SNPs; OR_Top1% 3.77 [95% CI: 2.71,5.23]) than for the lassoum-PRS (OR_Top1% = 3.38 [95% CI: 2.42,4.71]). We found signs that the logistic regression model that we used for the evaluation might be misspecified for some PRS, e.g., four of the seven PRS were not well-calibrated according to the Goodness-of-Fit test statistics (Hosmer-Lemeshow P <= 0.05) (Table 3). In UKB, we observed an identical ranking of PRS methods but we noted several differences with MGI. First, the tuning parameters of the lassosum-PRS and the P&T-PRS differed between MGI and UKB, resulting in a different number of included variants (lassosum: MGI: 44,815 variants versus UKB: 286,144 variants; P&T: MGI 2,723 variants versus UKB: 1,682 variants) (Table 3). Closer inspection of the underlying tuning parameter optimization revealed comparable parameter ranking for lassosum and P&T, suggesting that optimizations seem cohort-specific but stable, i.e., a predictive PRS established in UKB might perform similarly well in MGI and vice versa (Spearman’s rank correlation > 0.98) (Figure S1).

Comparison across GWAS Sources

We also explored the influence of various GWAS sources on the predictive performance of PRS. As an illustrative example, we again focus on breast cancer PRS, but now consider PRS constructed from different breast cancer GWAS sources, using for each source the method that yielded the highest pseudo-R² (Table 4). In MGI, the PRS (lassosum) of the largest available GWAS (122,977 cases and 105,974 controls) yielded the best performance across most PRS metrics (e.g. pseudo-R² = 0.057, AUC = 0.635 [0.624,0.647]). The GWAS Catalog-PRS (P&T), which included 79 top hits from 18 different GWAS [11, 34–51], was ranked second (pseudo-R² = 0.034) and showed significantly inferior discrimination ability (AUC 0.603 [0.591,0.615]). The case enrichment in the top 1% was pronounced but not significantly different from the top-ranked PRS (OR_Top1%[GWAS Catalog] = 3.38 [2.42,4.71] versus OR_Top1%[Large GWAS] = 3.52 [2.52,4.89]). The four UKB GWAS-based PRS (all based on lassosum) followed next and showed similar performances (pseudo-R²: 0.029 – 0.022; AUC between 0.603 – 0.586 with overlapping confidence intervals) and could be ranked according to their effective sample sizes. Most interestingly, the “UKB PheCode” PRS (6,977 variants) could differentiate cases and controls as well as the GWAS Catalog PRS, which was based on 79 independent risk variants with P <= 2.5e-08 reported in 18 GWAS (both AUC 0.603 [0.591,0.615]). This suggested that biobank-based PRS can be a viable alternative for PRS construction, especially if summary statistics from a large disease-specific GWAS are unavailable (Table 4). A detailed comparison of GWAS sources across the 49 cancer traits in MGI is available in Table S3.

View this table:

Table 4.

Influence of GWAS sources on Breast Cancer PRS Performance in MGI. Shaded cells indicate best performing PRS according to the corresponding metrics for MGI or UKB.

Comparison of Performance Across Methods

First, we explored the benefit of p-value thresholding for the pre-filtered risk variants of the GWAS Catalog. Compared to the GWAS hits only approach, i.e., only perform LD-clumping of risk variants with P <= 5×10^-8, the P-value thresholding step of the P&T PRS construction improved PRS performance, as previously reported [52]. This implied that P-value thresholding might to be beneficial even for the relatively sparse sets of GWAS hits reported in the GWAS Catalog (Figure S2).

The P&T approach will, by definition, also cover fixed p-value thresholds in its tuning parameter optimization, i.e., the final P&T PRS will be based on the p-value threshold with the highest pseudo-R² and thus perform at least as well as any tested fixed p-value thresholds. Therefore, we limited our next comparison of PRS methods for full summary statistics to P&T and lassosum PRS. We assessed both methods for different GWAS sources in MGI (58 PRS) and UKB (12 PRS). We found that both methods ranked comparably, i.e., a GWAS source that produced a lassosum PRS with high pseudo-R² also produced a P&T PRS with high pseudo-R² and vice versa (Spearman’s rank correlation: rho > 0.937; (Figure S3).

Comparison of Performance across Cancers

Next, we were interested in comparisons between PRS across traits to assess overall performance and general differences between cancer traits. Table 5 shows the top-ranked PRS for the 20 most common cancer traits in MGI and highlights the different properties of the generated PRS. The PRS vary in their numbers of included SNPs and their abilities to distinguish cases from controls or to enrich cases in the top percentiles. The AUC of the presented PRS was highest for cancer of prostate PRS (AUC=0.664 [0.652, 0.676]) and lowest for the nodular lymphoma PRS (0.535 [95% CI:0.512, 0.559]). Significant enrichment of cases in the top 1% ranged from OR of 6.54 (95% CI: 4.41, 9.79; cancer of prostate) to 2.13 (95% CI: 1.03, 4.01; leukemia). Due to limited sample sizes in the top percentiles, we could not detect significant enrichment for most of the rarer cancers.

View this table:

Table 5:

Top PRS for the 20 most common cancer traits in MGI (see Table 1)

View this table:

Table 6:

Best performing PRS for the 19 cancer traits in UKB

Our observed variations between these cancer PRS likely recapitulates the different genetic architectures of cancers in combination with their prevalences in the discovery and evaluation cohorts. First, the prevalence impacted the ability to identify true associations in the discovery GWAS and also affected our capacity to observe significant effects in the PRS performance evaluation.

Comparison of Performance across Cohorts

The two evaluation cohorts, MGI and UKB, varied in, among other things, their sample sizes, their use of diagnosis code systems, and their recruitment mechanisms, with UKB representing a population-based cohort and MGI an EHR-based, cancer-enriched cohort. We limited a comparison of the cancer PRS to the top-ranked PRS for 19 cancers that were present for both cohorts. We selected the top PRS for each cancer within each cohort, i.e., their GWAS source and method might be different between MGI and UKB.

We noticed the same ranking of AUC values for most cancer PRS but found significantly higher estimates for cancer of brain, cancer of brain / nervous system, colorectal cancer, and prostate cancer in UKB than in MGI (Figure S4). The former two estimates might reflect the different underlying GWAS sources, while the latter two might be inflated in UKB due to overlapping samples between their discovery GWAS meta-analyses and the UKB cohort [12, 15]. The other 15 cancers showed a similar ranking of AUC estimates in both cohorts that ranged between “Cancer of other lymphoid, histiocytic tissue” (AUCMGI: 0.527, AUCUKB: 0.532) and highest for “chronic lymphoid leukemia” (AUCMGI: 0.682, AUCUKB: 0.690). AUC values tended to be slightly higher for UKB than for MGI, while confidence intervals were mostly smaller in UKB corresponding to their (often) larger observed effective sample sizes.

A similar comparison of the enrichment of cases in the top 10% versus bottom 90% revealed a lack of power for many cancer PRS in MGI with OR <= 1.3, but a relatively consistent ranking from PRS for pancreatic cancer (MGI OR_Top10%: 1.37 and UKB OR_Top10%: 1.64) to prostate cancer (MGI OR_Top10%: 3.65 and UKB OR_Top10%: 3.97). Overall enrichment effects were often stronger in UKB compared to MGI, reflecting the larger sample sizes of these cancers but also indicated a disparity between population- and hospital-based controls (Table 1 & Table S6; Figure S5). However, when comparing the enrichment of cases for two PRS that were well-powered in both cohorts (PRS for breast cancer and chronic lymphoid leukemia), we found it to be strikingly comparable across all tested percentiles (1, 2, 5, 10, and 25% versus rest; Figure 2).

Figure 2.

Distribution of breast cancer PRS (A, B), and chronic lymphoid leukemia (C, D) in matched case controls samples in MGI (A, C) and UKB (B, D). Enrichment of cases in five top PRS percentiles is indicated by the shaded areas under the density curves while corresponding Odds ratios (OR) are given in the top right corner of each plot. PRS were standardized.

Phenome-wide Association Analyses

Beyond case enrichment and risk stratification, PRS can also be used in phenome-wide screenings to uncover secondary trait associations through shared genetic risk factors [31, 32]. These secondary traits might uncover features in the EHR that occur before cancer diagnosis and thus could represent important predictors for cancer outcomes. From the generated PRS for 49 cancer traits, we selected 13 PRS in MGI and 18 PRS in UKB (whose association with their corresponding cancer traits reached phenome-wide significance) for phenome-wide screens of PRS associations. In total, we observed phenome-wide significant associations between 21 cancer PRS and 150 different secondary traits (Table S5). We performed “Exclusion-PRS-PheWAS” (i.e., removed primary cancer cases and repeated the phenome-wide analysis) to assess if the identified secondary associations were mainly driven by the primary cancer trait, e.g., through intensified screening or represent post-treatment effects [31]. While the exclusion of cases markedly decreased case counts of secondary traits, we still identified secondary traits that remained significantly associated with the corresponding cancer PRS (Table 7&8, Table S5). Most of the secondary traits in MGI that remained phenome-wide significant in the Exclusion-PRS-PheWAS, e.g., skin cancer PRS associated with actinic keratosis or thyroid cancer PRS associated with hypothyroidism, were reported in our previous studies [32, 53]. Due to the larger sample sizes for most traits in UKB compared to MGI (Table S6), we observed more and stronger secondary trait associations in UKB PRS-PheWAS. Several secondary trait associations were seen in both cohorts (e.g., hypothyroidism associated with thyroid cancer PRS after exclusion thyroid cancer cases: OR_MGI = 0.864 [0.838, 0.89] and OR_UKB = 0.896 [0.881,0.912]; Figure 3 A & C). We also observed several secondary trait associations exclusively in UKB. Some of these associations, e.g., hyperplasia of prostate associated with cancer of prostate PRS (Exclusion-PRS-PheWAS in UKB: OR 1.07 [95% CI: 1.05, 1.09], P = 2.16E-10), represent known risk factors or presentation features of primary cancers [54, 55]. However, we also observed traits where the cancer relevance was less clear, e.g., varicose veins associated with breast cancer PRS (Exclusion-PRS-PheWAS in UKB: OR 1.05 [95% CI: 1.03,1.07], P = 2.88E-07) (Figure 3 C & D; Table S5). Deeper explorations and replications are needed to understand these observed associations and to distinguish between spurious and genuine associations.

Figure 3

Exclusion PRS-PheWAS in the MGI and UKB phenomes. The horizontal line indicates phenome-wide significance. Only the strongest and phenome-wide significantly associated traits within a category are labelled. Directional triangles indicate whether a phenome-wide significant trait was positively (pointing up) or negatively (pointing down) associated with the PRS.

View this table:

Table 7.

Secondary traits associated with primary cancer PRS in MGI. Trait / PRS associations that reached phenome-wide significance (P <= 0.05/1679) in the PRS PheWAS in MGI after excluding the primary and related cancer traits from the analysis are shown

View this table:

Table 8.

Secondary traits associated with primary cancer PRS in UKB. PRS / trait associations that reached phenome-wide significance (P <= 0.05/1419) in the PRS PheWAS in UKB after excluding the primary and related cancer traits from the analysis are show.

Online Visual Catalog: Cancer PRSweb and R package Rprs

In our current study, we compared three PRS construction methods for 68 cancer traits using over 232 sets of GWAS summary statistics. By doing so, we created a large number of PRS in which predictive or enrichment properties differed between GWAS source, PRS method, and/or evaluation cohort. After assessing 1,292 constructed PRS, we found PRS for 49 different cancer traits that we deemed to have predictive value. In our explorations, we established that it could be beneficial to select PRS with certain predictive properties for a specific application instead of using one PRS for all applications. Also, it could be computationally more convenient to use a slightly less powerful PRS based on a fewer number of SNPs than to use a PRS that is based on a few hundred thousand variants. To allow the user the option to explore various PRS constructs, we created PRSweb (http://prsweb.sph.umich.edu), an interactive and intuitive web interface, to explore the available PRS constructs for 49 different cancer traits as well as their performance metrics and suitability for risk stratification, association studies, or other PRS applications.

After an initial selection menu for cancer trait and evaluation cohort (MGI or UKB) PRSweb provides tabularized information about all available PRS, their evaluation metrics (performance, discrimination, calibration, and accuracy) and case enrichment capabilities in five top percentiles of their distributions. The tables, similar to Tables 3 & 4, can be sorted, filtered, or downloaded in full. These tables contain detailed information about the underlying GWAS source(s), LD reference panels and are directly linked to downloadable PRS constructs. The PRS construct files include headers with information about the PRS construction (source, version, method, and references) and lists its underlying risk variants, their physical positions, effect/non-effect alleles (forward strand orientation for a given genome assembly), and its weights. Together with the “Rprs” R package (https://github.com/statgen/Rprs) we developed, the construct file will enable the reproduction of PRS association in MGI or UKB and allow a straightforward generation of comparable PRS in external datasets using imputed dosage data in VCF or BCF format.

For phenome-wide predictive PRS (association P_{PrimaryCancer} ≤ 0.05 / [# phenotypes in phenome]), PRSweb also links to PRS-PheWAS results for their evaluation cohort. The PRS-PheWAS result page includes interactive Manhattan plots for PRS-PheWAS and Exclusion-PRS-PheWAS with mouseover information for each tested association. The PheWAS plots can be exported as scalable vector graphics (SVG) and are accompanied by interactive and downloadable result tables that provide PheWAS summary statistics plus sample counts per analyzed phenotype.

We also implemented a search interface for each phenotype/PheCode to provide insights into the ICD-codes underlying the primary cancers as well as the traits of our EHR-derived MGI and UKB phenomes. A methods section describes the applied approaches.

Discussion

In our study, we constructed and evaluated a large set of cancer PRS using more than 200 different sources of GWAS summary statistics. We applied three common PRS construction methods: GWAS hits, LD pruning/P-value thresholding, and lassosum. While doing so, we created an online repository called “PRSweb” (http://prsweb.sph.umich.edu/) with over 600 PRS for 49 cancer traits.

We observed that construction and resulting performance of PRS depend on multiple factors, including GWAS source, PRS method, and evaluation cohort. Researchers who plan to apply PRS in their projects are often faced with an agony of choice from a set of PRS in the current literature or might not find predictive PRS at all. Furthermore, if PRS are available, a direct comparison of multiple constructs is often not feasible, as their performance can be cohort-specific and limited by available sample size.

To alleviate this situation, we generated “PRSweb” that could serve as a central hub for standardized PRS. PRSweb so far offered a selection and exploration of PRS based on publicly available cancer GWAS data. The platform integrated the evaluation of the rich EHR data of two independent biobanks, MGI and UKB. In our initial version of PRSweb, we focused on cancer traits because MGI is enriched for cancer.

There are several remaining challenges in developing PRS, and we will discuss the following four: access to independent GWAS summary statistics, mapping of trait definitions between discovery and evaluation cohorts, power limitations, and finally, transferability of PRS across cohorts and ancestries.

Access to independent GWAS summary statistics

Limited accessibility to full summary statistics for cancer GWAS in the published literature resulted in a lack of PRS constructs for many cancers. By systematically integrating openly available cancer GWAS summary statistics, we can also openly share PRS constructs, some with millions of markers, with the research community. However, there are large cancer GWAS datasets used in the cancer research community that are not yet integrated into PRSweb. For example, a recent study analyzed fourteen different cancer types based on summary-level association statistics from larger cancer GWAS consortia [56]. To our knowledge, only the full summary statistics on breast cancer [11], ovarian cancer [9], and prostate cancer [12] were openly shared. We are confident that future versions of PRSweb will be able to integrate summary statistics from other large GWAS consortia, e.g., on chronic lymphocytic leukemia, glioma, melanoma, esophageal, testicular, oropharyngeal, pancreatic, renal, colorectal, endometrial, or lung cancer, some with substantially larger samples sizes than the GWAS used in our current analysis.

With the tendency to form large consortia and to integrate available biobank data comes another challenge, namely the potential overlap between the discovery and evaluation cohorts and, thus, potential overfitting. For our current study, we used GWAS that are to the best of our knowledge, independent from MGI. Since UKB is a popular and widely used resource, the assumption of independence of large GWAS efforts from UKB, does not always hold true as we have seen for the large colorectal cancer GWAS[15]. In the future, the assessment of independence of GWAS from PRS construction will become more challenging, especially when relying on GWAS databases (e.g., the GWAS Catalog), where the distinction of contributing cohorts might not be obvious from a database entry alone. An alternative solution, especially for consortia joining large GWAS, is leave-one-out meta-analysis where in addition to the full meta-analysis results, a separate set of meta-analysis results will be provided for each contributing cohort so that each resulting leave-one-out meta-analysis can be shared and used for PRS generation in that cohort to avoid overfitting.

We anticipate a more accessible landscape of high-quality full GWAS results in the near future, not only for cancer. First, funding agencies are updating their policy regarding access to GWAS summary statistics of funded projects, e.g., the US National Institutes of Health (NIH), (https://grants.nih.gov/grants/guide/notice-files/NOT-OD-19-023.html). Secondly, biobank studies are growing in numbers and size and, when connected to EHR data, enable GWAS for thousands of traits each [53]. In addition, global efforts are forming that will enable even more powerful phenome x GWAS meta-analyses through collaboration, likely reaching sample sizes that can compete with classical disease-specific consortia [57].

Mapping of Trait Definitions

One of the premises for PRS utility is the resemblance of the original trait in the discovery GWAS with the trait of the evaluation cohort. For our current study, we relied on EHR-based cohorts and defined cancer via PheWAS codes that are adopted from ICD codes. It is important to bear in mind that we used EHR-based diagnosis data that per se were not collected for research. Besides misclassification, EHR-derived phenomes might be prone to selection and recruitment biases that can negatively impact power or result in false-positive associations [53]. ICD codes usually serve administrative and billing purposes and often lack the specificity found in the discovery GWAS. Due to the difference in trait definitions, we often had to fall back to the broad phenotype definition in the EHR cohorts and, by doing so, might have negatively influenced the predictive power for PRS [58]. For example, we only had one definition for ovarian cancer in MGI and UKB (Phecode 184.11 “Malignant neoplasm of ovary”) that was defined by ICD9 codes 183.0 and V10.43 as well as by ICD10:C56 and their sub-codes. In contrast, the large GWAS on ovarian cancer included results for nine more refined cancer subtypes: invasive epithelial, low-grade serous, high-grade serous, serous invasive, endometrioid, epithelial, mucinous, low-grade serous and serous borderline ovarian cancer, and ovarian clear cell cancer. For our PRS generation, we used all nine GWAS as separate sources and tested each resulting PRS against the single PheWAS code 184.11. Consequently, the best performing PRS might represent the combination where the discovery GWAS’s trait specificity and the cohort’s trait composition maximized predictive power.

While we were bound to PheWAS code definitions, future PRS explorations and evaluations of growing EHR data should include more refined cancer phenotypes by integrating cancer registry data and/or natural language processing of clinical notes. Still, the chosen phenotype definitions represent valid and common cancer groupings that are frequently used in clinical and research applications [59].

Power Limitations

For our project, we used data from MGI, a medical center-based cohort, and UKB, a population-based cohort. Due to MGI’s recruitment mechanism through surgery, the observed case counts in MGI reflected the numbers of adult (18+) patients that underwent a surgical procedure and had at least one corresponding cancer diagnosis in their medical records. The case counts in UKB, a rather healthy subset of the older (40-69) British population [60], might be even lower than the population’s cancer prevalence. We observed an enrichment of many cancers in MGI compared to UKB, especially for rarer cancers like thyroid cancer, but generally registered more cases in the UKB because its cohort is ten times larger (Table S6). In addition, MGI’s recruitment through surgical procedures likely resulted in a relative depletion of blood cancers (e.g., leukemia, lymphoma, and myeloma), since affected patients undergo surgery less frequently than somatic cancer patients. As a consequence, we often had sufficient power to evaluate and analyze PRS for these diseases in UKB but not MGI.

One may be interested in defining a combined phenotype of “any cancer” for a composite cancer PRS with a maximal sample size. We defined this phenotype in UKB (with 69,190 cases of any cancer; Tables S6 & S7, Figures S7 & S8), performed a GWAS that revealed known risk variants for numerous cancers, and created an “any cancer” PRS using our established methods. The lassosum PRS with a choice for 179 variants performed best amongst the constructs (Table S8). However, while defining such a composite phenotype, we have to remember that the endpoint is a heterogeneous mix of various cancers, and the discovery will be driven by the cancers with larger numbers of cases or strong risk effects in the discovery (UKB) and evaluation (MGI) cohort. In the PRS PheWAS in MGI, we saw many related traits associated with the overall PRS. No secondary trait reached phenome-wide significance in the exclusion PRS-PheWAS (Table S9; Figure S9). We incorporated this overall PRS constructs in Cancer-PRSweb.

Besides accessible sample sizes, the ability to create predictive PRS depends on the cancers’ “chip heritability,” i.e., the variance explained through polygenic variants of genotyped and imputed datasets. A previous study on six common cancers found that chip heritability estimates can vary substantially for cancers (e.g., estimated heritability for prostate cancer: 27%, breast cancer 12%, and pancreatic cancer 7%) [61]. Thus, indicating that even if the most powerful cancer PRS can be generated, other factors play a bigger role, emphasizing the limitations of PRS for personal risk prediction if used on its own without considering other risk factors [62].

Also, genetic architecture affects the choice of PRS construction methods. A recent study estimated the heritability explained by genome-wide significant variants for 14 common cancers and found a wide variability of explained heritability estimates among the analyzed cancer types. For some cancers like testicular cancer, chronic lymphocytic leukemia, prostate, and breast cancer, GWAS hits could explain a large fraction of the chip heritability, while GWAS hits for other cancers like esophageal, colorectal, endometrial, ovarian or lung cancer explained only moderate to very low fractions [56]. Consequently, approaches that only consider GWAS hits might work better for the former, while less conservative p-value thresholds or genome-wide PRS methods might work better for the latter cancer traits.

Transferability of PRS across cohorts

In our current study, we constructed and evaluated PRS in individuals of broadly European ancestry. However, we recognize the need to also construct and share PRS for non-European ancestry groups, especially because of the limited transferability of PRS across ancestries and ethnicities [7]. The integration of PRS for non-European individuals into our platform PRSweb so far is hampered by the scarcity of GWAS data for diverse ancestry groups [63], and by the limited diversity in MGI and UKB, both encompassing predominantly European ancestry individuals.

Differences in genotyping and sequencing strategies can also negatively impact comparability between studies. Ideally, genotype data in the discovery GWAS, the LD reference panel, and the evaluation cohort should be comparable in quality, density and LD structure for ultimate compatibility. GWAS usually rely on genotyping arrays that can differ in composition and density of variants. Phasing and imputation methods are constantly improving thanks to growing reference panels and refined methods [64] and are essential in harmonizing genotype data across cohorts. However, the achievable accuracy is dependent on the study’s sample size and variant density. Consequently, a PRS that was constructed from a large and marker-dense GWAS might not be directly transferable to smaller, sparser genotype data.

In our current analysis of two genotype datasets that differed in genotype density and sample size, we found that the tuning parameters of PRS established separately for MGI and UKB were ranked similarly in terms of their resulting predictive performance. This indicated that sharing of PRS constructs might represent a feasible and convenient alternative to computationally expensive PRS methods and evaluations.

Conclusions

By generating PRS from a large collection of freely available cancer GWAS summary statistics and by evaluating them in two independent biobanks, we created the analytical backbone of PRSweb, an online repository for cancer PRS offering detailed constructs and comparisons. We designed PRSweb for transparency and convenience to democratize PRS research. So far, we included PRS constructs and analyses for 49 different cancer traits that showed promising performance in MGI and/or UKB. We anticipate the inclusion of additional PRS constructs and methods in an upcoming version of PRSweb that also will expand our focus beyond cancers.

Several challenges remain in PRS research in terms of access, power, and transferability. Nevertheless, PRS have proven to be a valuable tool for risk stratification, especially if combined with non-genetic risk factors [65–67]. PRS will likely become more powerful with growing sample sizes, better tools, and more diverse resources.

Methods

Evaluation cohorts

MGI cohort

Adult (18+) participants were recruited through the Michigan Medicine health system while awaiting diagnostic or interventional procedures either during a preoperative visit prior to the procedure or on the day of the procedure that required anesthesia. In addition to coded biosamples and secure, protected health information, participants understood that all EHR, claims, and national data sources linkable to the participant may be incorporated into the MGI databank. Each participant donated a blood sample for genetic analysis, underwent baseline vital sign testing, and completed a comprehensive history and physical assessment (also see Ethics Statement below). We report results obtained from 38,360 unrelated, genotyped patients of inferred recent European ancestry with available integrated EHR data (∼90 % of all MGI participants were inferred to be of recent European ancestry) [68]. The data used in this study included diagnoses coded with the Ninth and Tenth Revision of the International Statistical Classification of Diseases (ICD9 and ICD10) with clinical modifications (ICD9-CM and ICD10-CM), sex, precomputed principal components (PCs), genotyping batch, and age. Data were collected according to the Declaration of Helsinki principles [69]. MGI study participants’ consent forms and protocols were reviewed and approved by the University of Michigan Medical School Institutional Review Board (IRB ID HUM00099605 and HUM00155849). Opt-in written informed consent was obtained.

UK Biobank cohort (UKB)

UKB is a population-based cohort collected from multiple sites across the United Kingdom and includes over 500,000 participants aged between 40 and 69 years when recruited in 2006–2010 [16]. The open-access UK Biobank data used in this study included genotypes, ICD9 and ICD10 codes, inferred sex, inferred White British ancestry, kinship estimates down to third degree, birthyear, genotype array, and precomputed principal components of the genotypes. Table 1 provides some descriptive statistics of the MGI and UK Biobank samples.

Genotyping, sample quality control and imputation

MGI

DNA from 47,364 blood samples was genotyped on customized Illumina Infinium CoreExome-24 bead arrays and subjected to various quality control filters, resulting in a set of 392,323 polymorphic variants. Principal components and ancestry were estimated by projecting all genotyped samples into the space of the principal components of the Human Genome Diversity Project reference panel using PLINK (938 individuals) [70, 71]. Pairwise kinship was assessed with the software KING [72], and the software fastindep was used to reduce the data to a maximal subset that contained no pairs of individuals with 3rd-or closer degree relationship [73]. We removed participants without EHR data and participants not of recent European descent from the analysis, resulting in a final sample of 38,360 unrelated subjects. Additional genotypes were obtained using the Haplotype Reference Consortium reference panel of the Michigan Imputation Server [58] and included over 24 million imputed variants with R² ≥0.3 and minor allele frequency (MAF) ≥0.01%. Genotyping, quality control, and imputation are described in detail elsewhere [68].

UK Biobank

We used the UK BioBank Imputed Dataset (v3, https://www.ebi.ac.uk/ega/datasets/EGAD00010001474) and limited analyses to the documented 408,961 White British [74] individuals and 47,836,001 variants with imputation information score >= 0.3 and MAF >= 0.01% of which 22,846,729 overlapped with the imputed MGI data (see above). Two random subsets of 5,000 and 10,000 unrelated, White British individuals were used for LD analyses of UKB-based summary statistics.

Phenome generation

MGI

The MGI phenome was used as the discovery dataset and was based on ICD9-CM and ICD10-CM code data for 38,360 unrelated, genotyped individuals of recent European ancestry. Longitudinal time-stamped diagnoses were recoded to indicators for whether a patient ever had given a diagnosis code recorded by Michigan Medicine. These ICD9-CM and ICD10-CM codes were aggregated to form up to 1,857 PheWAS traits using the PheWAS R package (as described in detail elsewhere [68, 75]). For each trait, we identified case and control samples. To minimize differences in age and sex distributions, avoid extreme case-control ratios, and reduce the computational burden, we matched up to 10 controls to each case using the R package “MatchIt” [76]. Nearest neighbor matching was applied for age and the first four principal components of the genotype data (PC1-4) using Mahalanobis distance with a caliper/width of 0.25 standard deviations. Exact matching was applied for sex and genotyping array. A total of 1,689 case-control studies with >50 cases were used for our analyses of the MGI phenome.

UK Biobank

The UK Biobank phenome was used as a replication dataset and was based on ICD9 and ICD10 code data of 408,961 White British [74], genotyped individuals that were similarly aggregated to PheWAS traits as MGI (as described elsewhere [77]). In contrast to MGI, there were many pairwise relationships reported for UKB participants.

To retain a larger effective sample size for each phenotype, we first selected a maximal set of unrelated cases for each phenotype (defined as no pairwise relationship of 3^rd degree or closer [11, 73]) before selecting a maximal set of unrelated controls unrelated to these cases. Similar to MGI, we matched up to 10 controls to each case using the R package “MatchIt” [76]. Nearest neighbor matching was applied for birthyear and PC1-4 (Mahalanobis-metric matching; matching window caliper/width of 0.25 standard deviations), and exact matching was applied for sex and genotyping array. A total of 1,419 case-control studies with >50 cases each were used for our analyses of the UK Biobank phenome.

On average, we were able to match 9 controls per case in the MGI phenome and 9.9 controls per case in the UKB phenome. Additional phenotype information for MGI and UK Biobank is included in (Table S6).

PRS Structure

PRS combine information across a defined set of genetic loci, incorporating each locus’s association with the target trait. The PRS for patient j takes the form PRS_j=∑_iβ_iG_ij where i indexes the included loci for that trait, weight βi is the log odds ratios retrieved from the external GWAS summary statistics for locus i, and G_ij is a continuous version of the measured dosage data for the risk allele on locus i in subject j. In order to construct a PRS, one must determine which genetic loci to include in the PRS and their relative weights. Below, we obtain GWAS summary statistics from several different sources, resulting in several sets of weights for each trait of interest. For each set of weights, we consider several strategies for determining which genetic loci to include in the PRS construction.

Sources of GWAS summary statistics

For each of 68 cancers of interest, we collected GWAS summary statistics from up to three different sources: (1) merged genome-wide significant association signals published in the NHGRI EBI GWAS Catalog [78] if available; (2) large cancer GWAS meta-analysis if available; and (3) publicly available GWAS summary statistics of phenome x genome screening efforts of the UK Biobank data [77] (see Web Resources; Figure 1). If needed, we used LiftOver to convert coordinates of GWAS summary statistics to human genome assembly GRCh37 (https://genome-store.ucsc.edu/).

GWAS Catalog

We downloaded previously reported GWAS variants from the NHGRI-EBI GWAS Catalog (file version: r2019-05-03, https://www.ebi.ac.uk/gwas/) [78, 79]. Single nucleotide polymorphism (SNP) positions were converted to GRCh37 using variant IDs from dbSNP (build 151; UCSC Genome Browser, http://genome.ucsc.edu/) after updating outdated dbSNP IDs to their merged dbSNP IDs.

Entries with missing risk alleles, risk allele frequencies, or SNP-disease odds ratios were excluded. If a reported risk allele did not match any of the reported forward strand alleles of a non-ambiguous SNP (not A/T or C/G) in the imputed MGI genotype data (which correspond to the alleles of the imputation reference panel), we assumed minus-strand designation and corrected the effect allele to its complementary base of the forward strand. Entries with a reported risk allele that did not match any of the alleles of an ambiguous SNP (A/T and C/G) in our data were excluded at this step. We only included entries with broad European ancestry (as reported by the NHGRI-EBI GWAS Catalog) to match ancestries of discovery GWAS and target cohorts (MGI and UKB). As a quality control check, we compared the GWAS Catalog reported risk allele frequencies (RAF) with the RAF in MGI individuals. We then excluded entries whose RAF deviated more than 15%. This chosen threshold is subjective and was based on clear differentiation between correct and likely flipped alleles on the two diagonals (Figure S6), as noted frequently in GWAS meta-analyses quality control procedures [80]. For SNPs with multiple entries, we kept the SNP with the most recent publication date (and smaller p-value, if necessary) and excluded the others.

Large GWAS meta-analyses

We downloaded full GWAS summary statistics made available by the “Breast Cancer Association Consortium” (BCAC) [11], the “Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome” (PRACTICAL) [12], and the “Ovarian Cancer Association Consortium” (OCAC) [9]. In addition, we extracted partial GWAS summary statistics that accompanied recent publications but were incomplete, i.e. reporting only SNPs below a certain p-value threshold [15, 81–83]. GWAS summary statistics were harmonized and, if needed, lifted over to human genome assembly GRCh37. In this paper, this source is referred to as “Large GWAS”.

UK-Biobank-based GWAS

We downloaded UK Biobank based GWAS summary statistics from two public repositories.

The first set of UK Biobank GWAS summary statistics were based on the analysis of up to 408,961 White British European-ancestry samples (https://www.leelabsg.org/resources). SNP-disease odds ratios were estimated using logistic mixed modeling adjusting for sample relatedness, and p-values were estimated using saddlepoint approximations (SAIGE method [17]) to calibrate the distribution of score test statistics and, thus, control for unbalanced case-control ratios. The underlying phenotypes were auto-curated phenotypes based on the PheCodes of the PheWAS R package [68, 75, 77] similar to the phenomes used in our study and in the following are referred to as “UKB PHECODE” (Table S1).

The second set of UK Biobank GWAS summary statistics were based on a linear regression model of up to 361,194 unrelated White British samples adjusting for relevant covariates (https://github.com/Nealelab/UK_Biobank_GWAS). Three phenotype models were used in their analyses: (1) “PHESANT”: auto-curated phenotypes using PHEnome Scan ANalysis Tool (https://github.com/MRCIEU/PHESANT), (2) “ICD10”: individuals with the same ICD10 category code (first three characters, e.g. “C50”) were used as cases while all non-coded individuals were treated as controls, and (3) “FINNGEN”: curated phenotypes / endpoints based on definitions of the Finngen consortium (https://www.finngen.fi/en/researchers/clinical-endpoints). In addition to the “UKB PHECODE” (described above), these three latter sources are referred to as “UKB PHESANT”, “UKB ICD10” and “UKB FINNGEN”, respectively (Table S1).

PRS Construction

For each set of GWAS summary statistics from the above-mentioned sources and each cancer, we develop up to seven different PRS using three different construction methods (Figure 1). Our goal of this approach was to compare multiple PRS methods and find the method that works best for the various types of GWAS summary statistics.

For the first two construction strategies, we performed LD clumping/pruning of variants with p-values below 10^-4 by using the imputed allele dosages of 10,000 randomly selected samples and a pairwise correlation cut-off at r² < 0.1 within 1Mb window. Using the resulting loci, we defined up to five sub-sets of variants with p-values below different thresholds (<5×10^-9 to <5×10^-5). These were used to construct a PRS tied to each threshold, where the PRS associated with p-values less than 5×10^-8 is sometimes denoted as “GWAS hits.” For the second PRS construction method, we construct many different PRS across a fine grid of p-value thresholds. The p-value threshold with the highest cross-validated pseudo-R2 (see PRS Evaluation below) was used to define the more optimized “Pruning and Thresholding (P & T)” PRS.

As an alternative to the p-value thresholding and “P&T” PRS construction strategies, we also used the software package “lassosum” [22] to define a third type of PRS for GWAS sources with full summary statistics. Lassosum obtains PRS weights by applying elastic net penalization to GWAS summary statistics and incorporating LD information from a reference panel. Here, we used 5,000 randomly selected, unrelated samples as the LD reference panel. We applied a MAF filter of 1 % and, in contrast to the other two approaches, only included autosomal variants that overlap between summary statistics, LD reference panel, and target panel. Each “lassosum” run resulted in up to 76 combinations of the elastic net tuning parameters s and λ, and consequently, in 76 SNP sets with corresponding weights used to construct 76 PRS. We then selected the PRS with the highest cross-validated pseudo-R² to define the “lassosum” PRS.

For each cancer and set of GWAS summary statistics, this approach resulted in up to seven PRS, where PRS with less than 5 included variants were excluded and the available GWAS summary statistics limited the available PRS construction techniques in some cases. Using the R package “Rprs” (https://github.com/statgen/Rprs), the value of each PRS was then calculated for each MGI participant and, if the GWAS source was not based on UKB, also for each UKB participant. For comparability of association effect sizes corresponding to the continuous PRS across cancer traits and PRS construction methods, we centered PRS values in MGI and UKB to their mean and scaled them to have a standard deviation of 1.

PRS Evaluation

For the PRS evaluations, we fit the following model for each PRS and cancer phenotype without adjusting for covariates: We performed a 5-fold cross validation with the R package “caret” [84] to obtain fitted predictors for the actual PRS evaluations. We used Nagelkerke’s pseudo-R² [85] to select the tuning parameters within the “P&T” and lassosum construction methods (P-value for “P&T” SNP sets; s and λ for lassosum) and kept the PRS with the highest pseudo-R² for further analyses. For each PRS derived for each GWAS source/method combination, we assessed the following performance measures relative to observed disease status in MGI and UKB:

(1) overall performance with Nagelkerke’s pseudo-R² using R packages “rcompanion” [85], (2) accuracy with Brier score using R package “DescTools” [86]; (3) ability to discriminate between cases and controls as measured by the area under the receiver operating characteristic (ROC) curve (denoted AUC) using R package “pROC” [87] and (4) calibration using Hosmer-Lemeshow Goodness of Fit test in the R package “ResourceSelection” [88–90].

PRS Association Testing

Next, we assessed the strength of the relationship between these PRS and the traits they were designed for. To do this we fit the following model for each PRS and cancer phenotype adjusting for various covariates: where the PCs were the first four principal components obtained from the principal component analysis of the genotyped GWAS markers, where “Age” was the age at last observed diagnosis in MGI and birthyear in UKB and where “Array” represents the genotyping array. Our primary interest is β_PRS, while the other factors (Age, Sex and PC) were included to address potential residual confounding and do not provide interpretable estimates due to the preceding application of case-control matching. Firth’s bias reduction method was used to resolve the problem of separation in logistic regression (Logistf in R package “EHR”) [91–93].

To study the ability of the PRS to identify high risk patients, we fit the above model but replacing the PRS with an indicator for whether the PRS value was in the top 1, 2, 5, 10, or 25% among the matched case control cohort.

Phenome-wide Exploration of PRS Associations

We selected PRS that were strongly associated with the cancer trait they were designed for phenome-wide association exploration in the phenomes of MGI and UKB for (p-value ≤ (0.05 / [#phenotypes in corresponding phenome]); see below).

We conducted PheWAS in MGI and also UKB (if the GWAS source was not based on UKB) to identify additional, secondary phenotypes associated with the PRS [31]. To evaluate PRS-phenotype associations, we conducted Firth bias-corrected logistic regression by fitting model of equation 1 above for each PRS and each phenotype of the corresponding phenome. To adjust for multiple testing, we applied the conservative phenome-wide Bonferroni correction according to the total number of analyzed PheWAS codes (MGI: 1,689 phenotypes; UKB: 1,419 phenotypes). In Manhattan plots, we present –log10 (p-value) corresponding to tests of H₀: β_PRS = 0. Directional triangles on the PheWAS plot indicate whether a phenome-wide significant trait was positively (pointing up) or negatively (pointing down) associated with the PRS.

To investigate the possibility of the secondary trait associations with PRS being completely driven by the primary trait association, we performed a second set of PheWAS after excluding individuals affected with the primary or related cancer traits for which the PRS was constructed, referred to as “Exclusion-PRS-PheWAS” as described previously [68].

Online Visual Catalog: PRSweb

The online open access visual catalog PRSweb was implemented using Grails, a Groovy- and Java-based backend logic, to integrate interactive visualizations and MySQL databases. Interactive PheWAS plots are drawn with the JavaScript library “LocusZoom.js” which is maintained by the UM Center for Statistical Genetics (https://github.com/statgen/locuszoom) and offers dynamic plotting, automatic plot sizing, and label positioning. Additional data-driven visualizations (e.g. temporal relationship plots) were implemented with the JavaScript library “D3.js”.

Unless otherwise stated, analyses were performed using R 3.6.1 [94].

Data availability

Data cannot be shared publicly due to patient confidentiality. The data underlying the results presented in the study are available from University of Michigan Medical School Central Biorepository at https://research.medicine.umich.edu/our-units/central-biorepository/get-access and from the UK Biobank at http://www.ukbiobank.ac.uk/register-apply/ for researchers who meet the criteria for access to confidential data.

Author contributions

Conceptualization: L.G.F., B.M.; Data Curation: L.G.F.; Formal Analysis: L.G.F., R.B.P.; Funding Acquisition: L.G.F., B.M.; Investigation: L.G.F.; Methodology: L.G.F., L.J.B., B.M.; Project Administration: L.G.F., B.M.; Resources: L.G.F., B.M.; Software: L.G.F., S.P., L.J.B., P.V., M.S., R.B.P., D.T.; Supervision: L.G.F., B.M.; Visualization: L.G.F., S.P., P.V.; Writing – Original Draft Preparation: L.G.F., B.M.; Writing – Review & Editing: L.G.F., S.P., L.J.B., P.V., M.S., R.B.P., D.T., X.Z., B.M.

Competing interests

The authors declare no competing interests.

Materials & Correspondence

larsf{at}umich.edu (L.G.F.), bhramar{at}umich.edu (B.M.)

Acknowledgements

The authors acknowledge the Michigan Genomics Initiative participants, Precision Health at the University of Michigan, the University of Michigan Medical School Data Office for Clinical and Translational Research, the University of Michigan Medical School Central Biorepository, and the University of Michigan Advanced Genomics Core for providing data storage, management, processing, and distribution services, and the Center for Statistical Genetics in the Department of Biostatistics at the School of Public Health for genotype data curation, imputation, and management in support of the research reported in this publication/grant application/presentation. Part of this research has been conducted using the UK Biobank Resource under application number 24460.

This material is based in part upon work supported by the National Institutes of Health/NIH (NCI P30CA046592) and by the National Science Foundation under grant number DMS-1712933. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Footnotes

Added missing acknowledgement of the UK Biobank data: Part of this research has been conducted using the UK Biobank Resource under application number 24460.
http://prsweb.sph.umich.edu

Reference

1.↵
Klein, R.J., et al., Complement factor H polymorphism in age-related macular degeneration. Science, 2005. 308(5720): p. 385–9.
OpenUrl Abstract/FREE Full Text
2.↵
Buniello, A., et al., The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res, 2019. 47(D1): p. D1005–D1012.
OpenUrl CrossRef PubMed
3.↵
Visscher, P.M., et al., 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet, 2017. 101(1): p. 5–22.
OpenUrl CrossRef PubMed
4.↵
Khera, A.V., et al., Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet, 2018. 50(9): p. 1219–1224.
OpenUrl CrossRef PubMed
5.
Dudbridge, F., Power and predictive accuracy of polygenic risk scores. PLoS Genet, 2013. 9(3): p. e1003348.
OpenUrl CrossRef PubMed
6.↵
Chatterjee, N., et al., Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet, 2013. 45(4): p. 400–5, 405e1-3.
OpenUrl CrossRef PubMed
7.↵
Martin, A.R., et al., Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am J Hum Genet, 2017. 100(4): p. 635–649.
OpenUrl CrossRef PubMed
8.↵
Lawrenson, K., et al., Genome-wide association studies identify susceptibility loci for epithelial ovarian cancer in east Asian women. Gynecol Oncol, 2019. 153(2): p. 343–355.
OpenUrl
9.↵
Phelan, C.M., et al., Identification of 12 new susceptibility loci for different histotypes of epithelial ovarian cancer. Nat Genet, 2017. 49(5): p. 680–691.
OpenUrl CrossRef
10.↵
Lee, J.Y., et al., BRCA1/2-negative, high-risk breast cancers (BRCAX) for Asian women: genetic susceptibility loci and their potential impacts. Sci Rep, 2018. 8(1): p. 15263.
OpenUrl
11.↵
Michailidou, K., et al., Association analysis identifies 65 new breast cancer risk loci. Nature, 2017. 551(7678): p. 92–94.
OpenUrl
12.↵
Schumacher, F.R., et al., Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat Genet, 2018. 50(7): p. 928–936.
OpenUrl
13.↵
Tanikawa, C., et al., GWAS identifies two novel colorectal cancer loci at 16q24.1 and 20q13.12. Carcinogenesis, 2018. 39(5): p. 652–660.
OpenUrl
14.↵
Leo, P.J., et al., Defining the genetic susceptibility to cervical neoplasia-A genome-wide association study. PLoS Genet, 2017. 13(8): p. e1006866.
OpenUrl CrossRef PubMed
15.↵
Huyghe, J.R., et al., Discovery of common and rare genetic risk variants for colorectal cancer. Nat Genet, 2019. 51(1): p. 76–87.
OpenUrl CrossRef
16.↵
Sudlow, C., et al., UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med, 2015. 12(3): p. e1001779.
OpenUrl CrossRef PubMed
17.↵
Zhou, W., et al., Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature Genetics, 2018. 50(9): p. 1335–1341.
OpenUrl CrossRef PubMed
18.↵
Millard, L.A.C., et al., Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. Int J Epidemiol, 2017.
19.↵
Wu, P., et al., Developing and Evaluating Mappings of ICD-10 and ICD-10-CM Codes to PheCodes. bioRxiv, 2019: p. 462077.
20.↵
Shi, X., H. Pashova, and P.J. Heagerty, Comparing healthcare utilization patterns via global differences in the endorsement of current procedural terminology codes. Ann. Appl. Stat., 2017. 11(3): p. 1349–1374.
OpenUrl
21.↵
Vilhjalmsson, B.J., et al., Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet, 2015. 97(4): p. 576–92.
OpenUrl CrossRef PubMed
22.↵
Mak, T.S.H., et al., Polygenic scores via penalized regression on summary statistics. Genet Epidemiol, 2017. 41(6): p. 469–480.
OpenUrl
23.
Lloyd-Jones, L.R., et al., Improved polygenic prediction by Bayesian multiple regression on summary statistics. 2019: p. 522961.
24.↵
Ge, T., et al., Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 2019. 10(1): p. 1776.
OpenUrl
25.↵
Zhou, X., P. Carbonetto, and M. Stephens, Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet, 2013. 9(2): p. e1003264.
OpenUrl CrossRef PubMed
26.↵
Prive, F., et al., Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics, 2018. 34(16): p. 2781–2787.
OpenUrl
27.↵
Prive, F., H. Aschard, and M.G.B. Blum, Efficient Implementation of Penalized Regression for Genetic Risk Prediction. Genetics, 2019. 212(1): p. 65–74.
OpenUrl Abstract/FREE Full Text
28.↵
Mavaddat, N., et al., Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes. Am J Hum Genet, 2019. 104(1): p. 21–34.
OpenUrl CrossRef PubMed
29.
Du, Z., et al., A genome-wide association study of prostate cancer in Latinos. Int J Cancer, 2019.
30.↵
Shieh, Y., et al., A polygenic risk score for breast cancer in U.S. Latinas and Latin-American women. J Natl Cancer Inst, 2019.
31.↵
Fritsche, L.G., et al., Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. American Journal of Human Genetics, 2018. 102(6): p. 1048–1061.
OpenUrl
32.↵
Fritsche, L.G., et al., Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb. PLoS Genet, 2019. 15(6): p. e1008202.
OpenUrl
33.↵
Song, M., et al., Testing calibration of risk models at extremes of disease risk. Biostatistics, 2015. 16(1): p. 143–54.
OpenUrl CrossRef PubMed
34.↵
Easton, D.F., et al., Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 2007. 447(7148): p. 1087–93.
OpenUrl CrossRef PubMed Web of Science
35.
Hunter, D.J., et al., A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet, 2007. 39(7): p. 870–4.
OpenUrl CrossRef PubMed Web of Science
36.
Stacey, S.N., et al., Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet, 2007. 39(7): p. 865–9.
OpenUrl CrossRef PubMed
37.
Gold, B., et al., Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci U S A, 2008. 105(11): p. 4340–5.
OpenUrl Abstract/FREE Full Text
38.
Ahmed, S., et al., Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat Genet, 2009. 41(5): p. 585–90.
OpenUrl CrossRef PubMed Web of Science
39.
Thomas, G., et al., A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet, 2009. 41(5): p. 579–84.
OpenUrl CrossRef PubMed Web of Science
40.
Turnbull, C., et al., Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet, 2010. 42(6): p. 504–7.
OpenUrl CrossRef PubMed Web of Science
41.
Antoniou, A.C., et al., A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptor-negative breast cancer in the general population. Nat Genet, 2010. 42(10): p. 885–92.
OpenUrl CrossRef PubMed Web of Science
42.
Li, J., et al., A combined analysis of genome-wide association studies in breast cancer. Breast Cancer Res Treat, 2011. 126(3): p. 717–27.
OpenUrl CrossRef PubMed Web of Science
43.
Fletcher, O., et al., Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl Cancer Inst, 2011. 103(5): p. 425–35.
OpenUrl CrossRef PubMed Web of Science
44.
Sehrawat, B., et al., Potential novel candidate polymorphisms identified in genome-wide association study for breast cancer susceptibility. Hum Genet, 2011. 130(4): p. 529–37.
OpenUrl CrossRef PubMed
45.
Rinella, E.S., et al., Genetic variants associated with breast cancer risk for Ashkenazi Jewish women with strong family histories but no identifiable BRCA1/2 mutation. Hum Genet, 2013. 132(5): p. 523–36.
OpenUrl
46.
Michailidou, K., et al., Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat Genet, 2013. 45(4): p. 353–61, 361e1-2.
OpenUrl CrossRef PubMed
47.
Garcia-Closas, M., et al., Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nat Genet, 2013. 45(4): p. 392–8, 398e1-2.
OpenUrl CrossRef PubMed
48.
Gaudet, M.M., et al., Identification of a BRCA2-specific modifier locus at 6p24 related to breast cancer risk. PLoS Genet, 2013. 9(3): p. e1003173.
OpenUrl CrossRef PubMed
49.
Couch, F.J., et al., Genome-wide association study in BRCA1 mutation carriers identifies novel loci associated with breast and ovarian cancer risk. PLoS Genet, 2013. 9(3): p. e1003212.
OpenUrl CrossRef PubMed
50.
Michailidou, K., et al., Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet, 2015. 47(4): p. 373–80.
OpenUrl CrossRef PubMed
51.↵
Milne, R.L., et al., Identification of ten variants associated with risk of estrogen-receptor-negative breast cancer. Nat Genet, 2017. 49(12): p. 1767–1778.
OpenUrl
52.↵
Euesden, J., C.M. Lewis, and P.F. O’Reilly, PRSice: Polygenic Risk Score software. Bioinformatics, 2015. 31(9): p. 1466–8.
OpenUrl CrossRef PubMed
53.↵
Beesley, L., et al., The Emerging Landscape of Epidemiological Research Based on Biobanks Linked to Electronic Health Records: Existing Resources, Analytic Challenges and Potential Opportunities. Preprints, 2018. 2018090388.
54.↵
Orsted, D.D., et al., Association of clinical benign prostate hyperplasia with prostate cancer incidence and mortality revisited: a nationwide cohort study of 3,009,258 men. Eur Urol, 2011. 60(4): p. 691–8.
OpenUrl CrossRef PubMed
55.↵
Dai, X., et al., Benign Prostatic Hyperplasia and the Risk of Prostate Cancer and Bladder Cancer: A Meta-Analysis of Observational Studies. Medicine (Baltimore), 2016. 95(18): p. e3493.
OpenUrl
56.↵
Zhang, Y., et al., Assessment of Polygenic Architecture and Risk Prediction based on Common Variants Across Fourteen Cancers. bioRxiv, 2019: p. 723825.
57.↵
Zhou, W., et al. Global Biobank Meta-analysis Initiative: Powering genetic discovery across human diseases. in 69th Annual Meeting of the American Society of Human Genetics 2019. Houston, Tx, USA.
58.↵
McCarthy, S., et al., A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 2016. 48(10): p. 1279–83.
OpenUrl CrossRef PubMed
59.↵
Wei, W.Q., et al., Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS One, 2017. 12(7): p. e0175508.
OpenUrl CrossRef PubMed
60.↵
Flint, E. and S. Cummins, Active commuting and obesity in mid-life: cross-sectional, observational evidence from UK Biobank. Lancet Diabetes Endocrinol, 2016. 4(5): p. 420–35.
OpenUrl
61.↵
Lindstrom, S., et al., Quantifying the Genetic Correlation between Multiple Cancer Types. Cancer Epidemiol Biomarkers Prev, 2017. 26(9): p. 1427–1435.
OpenUrl Abstract/FREE Full Text
62.↵
Torkamani, A., N.E. Wineinger, and E.J. Topol, The personal and clinical utility of polygenic risk scores. Nat Rev Genet, 2018. 19(9): p. 581–590.
OpenUrl CrossRef PubMed
63.↵
Sirugo, G., S.M. Williams, and S.A. Tishkoff, The Missing Diversity in Human Genetic Studies. Cell, 2019. 177(1): p. 26–31.
OpenUrl
64.↵
Das, S., et al., Next-generation genotype imputation service and methods. Nat Genet, 2016. 48(10): p. 1284–1287.
OpenUrl CrossRef PubMed
65.↵
Maas, P., et al., Breast Cancer Risk From Modifiable and Nonmodifiable Risk Factors Among White Women in the United States. JAMA Oncol, 2016. 2(10): p. 1295–1302.
OpenUrl
66.
Garcia-Closas, M., N.B. Gunsoy, and N. Chatterjee, Combined associations of genetic and environmental risk factors: implications for prevention of breast cancer. J Natl Cancer Inst, 2014. 106(11).
67.↵
Das, J.K., et al., Ranking and clustering of Drosophila olfactory receptors using mathematical morphology. Genomics, 2019. 111(4): p. 549–559.
OpenUrl
68.↵
Fritsche, L.G., et al., Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. Am J Hum Genet, 2018. 102(6): p. 1048–1061.
OpenUrl
69.↵
World Medical Association, World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA, 2013. 310(20): p. 2191–4.
OpenUrl CrossRef PubMed Web of Science
70.↵
Wang, C., et al., Ancestry estimation and control of population stratification for sequence-based association studies. Nat Genet, 2014. 46(4): p. 409–15.
OpenUrl CrossRef PubMed
71.↵
Li, J.Z., et al., Worldwide human relationships inferred from genome-wide patterns of variation. Science, 2008. 319(5866): p. 1100–4.
OpenUrl Abstract/FREE Full Text
72.↵
Manichaikul, A., et al., Robust relationship inference in genome-wide association studies. Bioinformatics, 2010. 26(22): p. 2867–73.
OpenUrl CrossRef PubMed Web of Science
73.↵
Abraham, K.J. and C. Diaz, Identifying large sets of unrelated individuals and unrelated markers. Source Code Biol Med, 2014. 9(1): p. 6.
OpenUrl
74.↵
Bycroft, C., et al., Genome-wide genetic data on ∼500,000 UK Biobank participants. bioRxiv, 2017.
75.↵
Carroll, R.J., L. Bastarache, and J.C. Denny, R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics, 2014. 30(16): p. 2375–6.
OpenUrl CrossRef PubMed Web of Science
76.↵
Ho, D.E., et al., MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software, 2011. 42(8): p. 1–28.
OpenUrl
77.↵
Zhou, W., et al., Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet, 2018. 50(9): p. 1335–1341.
OpenUrl CrossRef PubMed
78.↵
MacArthur, J., et al., The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res, 2017. 45(D1): p. D896–D901.
OpenUrl CrossRef PubMed
79.↵
Welter, D., et al., The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res, 2014. 42(Database issue): p. D1001–6.
OpenUrl CrossRef PubMed Web of Science
80.↵
Winkler, T.W., et al., Quality control and conduct of genome-wide association meta-analyses. Nat Protoc, 2014. 9(5): p. 1192–212.
OpenUrl CrossRef PubMed
81.↵
Ransohoff, K.J., et al., Two-stage genome-wide association study identifies a novel susceptibility locus associated with melanoma. Oncotarget, 2017. 8(11): p. 17586–17592.
OpenUrl
82.
Chahal, H.S., et al., Genome-wide association study identifies 14 novel risk alleles associated with basal cell carcinoma. Nat Commun, 2016. 7: p. 12510.
OpenUrl CrossRef
83.↵
Chahal, H.S., et al., Genome-wide association study identifies novel susceptibility loci for cutaneous squamous cell carcinoma. Nat Commun, 2016. 7: p. 12048.
OpenUrl
84.↵
Kuhn, M., et al., caret: Classification and Regression Training. 2018.
85.↵
Mangiafico, S., rcompanion: Functions to Support Extension Education Program Evaluation. 2019.
86.↵
Signorell, A., DescTools: Tools for Descriptive Statistics. 2018.
87.↵
Robin, X., et al., pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 2011. 12: p. 77.
OpenUrl CrossRef PubMed
88.↵
Hosmer, D.W. and S. Lemeshow, Applied Logistic Regression. 2010, New York, USA: John Wiley and Sons.
89.
Lele, S., R., J.L. Keim, and P. Solymos, ResourceSelection: Resource Selection (Probability) Functions for Use-Availability Data. 2017.
90.↵
Steyerberg, E.W., et al., Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 2010. 21(1): p. 128–38.
OpenUrl CrossRef PubMed Web of Science
91.↵
Heinze, G., A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat Med, 2006. 25(24): p. 4216–26.
OpenUrl CrossRef PubMed Web of Science
92.
Heinze, G., et al., logistf: Firth’s bias reduced logistic regression. 2013.
93.↵
Choi, L. and C. Beck, EHR: Electronic Health Record (EHR) Data Processing and Analysis Tool. 2017.
94.↵
R Core Team, R: A Language and Environment for Statistical Computing. 2016, R Foundation for Statistical Computing, Vienna, Austria.
95.
Asgari, M.M., et al., Identification of Susceptibility Loci for Cutaneous Squamous Cell Carcinoma. J Invest Dermatol, 2016. 136(5): p. 930–937.
OpenUrl PubMed
96.
Siiskonen, S.J., et al., A Genome-Wide Association Study of Cutaneous Squamous Cell Carcinoma among European Descendants. Cancer Epidemiol Biomarkers Prev, 2016. 25(4): p. 714–20.
OpenUrl Abstract/FREE Full Text
97.
Kiemeney, L.A., et al., Sequence variant on 8q24 confers susceptibility to urinary bladder cancer. Nat Genet, 2008. 40(11): p. 1307–12.
OpenUrl CrossRef PubMed Web of Science
98.
Wu, X., et al., Genetic variation in the prostate stem cell antigen gene PSCA confers susceptibility to urinary bladder cancer. Nat Genet, 2009. 41(9): p. 991–5.
OpenUrl CrossRef PubMed Web of Science
99.
Kiemeney, L.A., et al., A sequence variant at 4p16.3 confers susceptibility to urinary bladder cancer. Nat Genet, 2010. 42(5): p. 415–9.
OpenUrl CrossRef PubMed Web of Science
100.
Rothman, N., et al., A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet, 2010. 42(11): p. 978–84.
OpenUrl CrossRef PubMed Web of Science
101.
Rafnar, T., et al., European genome-wide association study identifies SLC14A1 as a new urinary bladder cancer susceptibility gene. Hum Mol Genet, 2011. 20(21): p. 4268–81.
OpenUrl CrossRef PubMed Web of Science
102.
Figueroa, J.D., et al., Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum Mol Genet, 2014. 23(5): p. 1387–98.
OpenUrl CrossRef PubMed Web of Science
103.
Rafnar, T., et al., Genome-wide association study yields variants at 20p12.2 that associate with urinary bladder cancer. Hum Mol Genet, 2014. 23(20): p. 5545–57.
OpenUrl CrossRef PubMed
104.
Landi, M.T., et al., A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet, 2009. 85(5): p. 679–91.
OpenUrl CrossRef PubMed Web of Science
105.
Wang, Y., et al., Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet, 2014. 46(7): p. 736–41.
OpenUrl CrossRef PubMed
106.
McKay, J.D., et al., Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nat Genet, 2017. 49(7): p. 1126–1132.
OpenUrl
107.
Byun, J., et al., Genome-wide association study of familial lung cancer. Carcinogenesis, 2018. 39(9): p. 1135–1140.
OpenUrl
108.
Gudmundsson, J., et al., Common variants on 9q22.33 and 14q13.3 predispose to thyroid cancer in European populations. Nat Genet, 2009. 41(4): p. 460–4.
OpenUrl CrossRef PubMed Web of Science
109.
Gudmundsson, J., et al., Discovery of common variants associated with low TSH levels and thyroid cancer risk. Nat Genet, 2012. 44(3): p. 319–22.
OpenUrl CrossRef PubMed
110.
Kohler, A., et al., Genome-wide association study on differentiated thyroid cancer. J Clin Endocrinol Metab, 2013. 98(10): p. E1674–81.
OpenUrl CrossRef PubMed Web of Science
111.
Mancikova, V., et al., Thyroid cancer GWAS identifies 10q26.12 and 6q14.1 as novel susceptibility loci and reveals genetic heterogeneity among populations. Int J Cancer, 2015. 137(8): p. 1870–8.
OpenUrl CrossRef PubMed
112.
Gudmundsson, J., et al., A genome-wide association study yields five novel thyroid cancer risk loci. Nat Commun, 2017. 8: p. 14517.
OpenUrl CrossRef
113.
Spurdle, A.B., et al., Genome-wide association study identifies a common variant associated with risk of endometrial cancer. Nat Genet, 2011. 43(5): p. 451–4.
OpenUrl CrossRef PubMed
114.
Cheng, T.H., et al., Five endometrial cancer risk loci identified through genome-wide association analysis. Nat Genet, 2016. 48(6): p. 667–674.
OpenUrl CrossRef PubMed
115.
O’Mara, T.A., et al., Identification of nine new susceptibility loci for endometrial cancer. Nat Commun, 2018. 9(1): p. 3166.
OpenUrl CrossRef
116.
Brown, K.M., et al., Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat Genet, 2008. 40(7): p. 838–40.
OpenUrl CrossRef PubMed Web of Science
117.
Bishop, D.T., et al., Genome-wide association study identifies three loci associated with melanoma risk. Nat Genet, 2009. 41(8): p. 920–5.
OpenUrl CrossRef PubMed Web of Science
118.
Macgregor, S., et al., Genome-wide association study identifies a new melanoma susceptibility locus at 1q21.3. Nat Genet, 2011. 43(11): p. 1114–8.
OpenUrl CrossRef PubMed
119.
Barrett, J.H., et al., Genome-wide association study identifies three new melanoma susceptibility loci. Nat Genet, 2011. 43(11): p. 1108–13.
OpenUrl CrossRef PubMed
120.
Iles, M.M., et al., A variant in FTO shows association with melanoma risk not due to BMI. Nat Genet, 2013. 45(4): p. 428–32, 432e1.
OpenUrl CrossRef PubMed
121.
Law, M.H., et al., Genome-wide meta-analysis identifies five new susceptibility loci for cutaneous malignant melanoma. Nat Genet, 2015. 47(9): p. 987–995.
OpenUrl CrossRef PubMed
122.
Vijai, J., et al., A genome-wide association study of marginal zone lymphoma shows association to the HLA region. Nat Commun, 2015. 6: p. 5751.
OpenUrl CrossRef PubMed
123.
McMaster, M.L., et al., Two high-risk susceptibility loci at 6p25.3 and 14q32.13 for Waldenstrom macroglobulinemia. Nat Commun, 2018. 9(1): p. 4182.
OpenUrl
124.
Skibola, C.F., et al., Genetic variants at 6p21.33 are associated with susceptibility to follicular lymphoma. Nat Genet, 2009. 41(8): p. 873–5.
OpenUrl CrossRef PubMed Web of Science
125.
Conde, L., et al., Genome-wide association study of follicular lymphoma identifies a risk locus at 6p21.32. Nat Genet, 2010. 42(8): p. 661–4.
OpenUrl CrossRef PubMed Web of Science
126.
Smedby, K.E., et al., GWAS of follicular lymphoma reveals allelic heterogeneity at 6p21.32 and suggests shared genetic susceptibility with diffuse large B-cell lymphoma. PLoS Genet, 2011. 7(4): p. e1001378.
OpenUrl CrossRef PubMed
127.
Cerhan, J.R., et al., Genome-wide association study identifies multiple susceptibility loci for diffuse large B cell lymphoma. Nat Genet, 2014. 46(11): p. 1233–8.
OpenUrl CrossRef PubMed
128.
Skibola, C.F., et al., Genome-wide association study identifies five susceptibility loci for follicular lymphoma outside the HLA region. Am J Hum Genet, 2014. 95(4): p. 462–71.
OpenUrl CrossRef PubMed
129.
Turnbull, C., et al., A genome-wide association study identifies susceptibility loci for Wilms tumor. Nat Genet, 2012. 44(6): p. 681–4.
OpenUrl CrossRef PubMed
130.
Scelo, G., et al., Genome-wide association study identifies multiple risk loci for renal cell carcinoma. Nat Commun, 2017. 8: p. 15724.
OpenUrl CrossRef
131.
Amundadottir, L., et al., Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet, 2009. 41(9): p. 986–90.
OpenUrl CrossRef PubMed Web of Science
132.
Wu, C., et al., Genome-wide association study of survival in patients with pancreatic adenocarcinoma. Gut, 2014. 63(1): p. 152–60.
OpenUrl Abstract/FREE Full Text
133.
Wolpin, B.M., et al., Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer. Nat Genet, 2014. 46(9): p. 994–1000.
OpenUrl CrossRef PubMed
134.
Zhang, M., et al., Three new pancreatic cancer susceptibility signals identified on chromosomes 1q32.1, 5p15.33 and 8q24.21. Oncotarget, 2016. 7(41): p. 66328–66343.
OpenUrl
135.
Klein, A.P., et al., Genome-wide meta-analysis identifies five new susceptibility loci for pancreatic cancer. Nat Commun, 2018. 9(1): p. 556.
OpenUrl
136.
Wrensch, M., et al., Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility. Nat Genet, 2009. 41(8): p. 905–8.
OpenUrl CrossRef PubMed Web of Science
137.
Shete, S., et al., Genome-wide association study identifies five susceptibility loci for glioma. Nat Genet, 2009. 41(8): p. 899–904.
OpenUrl CrossRef PubMed Web of Science
138.
Sanson, M., et al., Chromosome 7p11.2 (EGFR) variation influences glioma risk. Hum Mol Genet, 2011. 20(14): p. 2897–904.
OpenUrl CrossRef PubMed Web of Science
139.
Rajaraman, P., et al., Genome-wide association study of glioma and meta-analysis. Hum Genet, 2012. 131(12): p. 1877–88.
OpenUrl CrossRef PubMed
140.
Walsh, K.M., et al., Variants near TERT and TERC influencing telomere length are associated with high-grade glioma risk. Nat Genet, 2014. 46(7): p. 731–5.
OpenUrl CrossRef PubMed
141.
Kinnersley, B., et al., Genome-wide association study identifies multiple susceptibility loci for glioma. Nat Commun, 2015. 6: p. 8559.
OpenUrl CrossRef PubMed
142.
Melin, B.S., et al., Genome-wide association study of glioma subtypes identifies specific differences in genetic susceptibility to glioblastoma and non-glioblastoma tumors. Nat Genet, 2017. 49(5): p. 789–794.
OpenUrl CrossRef PubMed
143.
Broderick, P., et al., Common variation at 3p22.1 and 7p15.3 influences multiple myeloma risk. Nat Genet, 2011. 44(1): p. 58–61.
OpenUrl CrossRef PubMed
144.
Weinhold, N., et al., The CCND1 c.870G>A polymorphism is a risk factor for t(11;14)(q13;q32) multiple myeloma. Nat Genet, 2013. 45(5): p. 522–525.
OpenUrl CrossRef PubMed
145.
Chubb, D., et al., Common variation at 3q26.2, 6p21.33, 17p11.2 and 22q13.1 influences multiple myeloma risk. Nat Genet, 2013. 45(10): p. 1221–1225.
OpenUrl CrossRef PubMed
146.
Swaminathan, B., et al., Variants in ELL2 influencing immunoglobulin levels associate with multiple myeloma. Nat Commun, 2015. 6: p. 7213.
OpenUrl CrossRef PubMed
147.
Mitchell, J.S., et al., Genome-wide association study identifies multiple susceptibility loci for multiple myeloma. Nat Commun, 2016. 7: p. 12050.
OpenUrl CrossRef
148.
Went, M., et al., Identification of multiple risk loci and regulatory mechanisms influencing susceptibility to multiple myeloma. Nat Commun, 2018. 9(1): p. 3707.
OpenUrl
149.
Xiao, Y., et al., SSBP2 variants are associated with survival in glioblastoma patients. Clin Cancer Res, 2012. 18(11): p. 3154–62.
OpenUrl Abstract/FREE Full Text
150.
Di Bernardo, M.C., et al., A genome-wide association study identifies six susceptibility loci for chronic lymphocytic leukemia. Nat Genet, 2008. 40(10): p. 1204–10.
OpenUrl CrossRef PubMed Web of Science
151.
Slager, S.L., et al., Genome-wide association study identifies a novel susceptibility locus at 6p21.3 among familial CLL. Blood, 2011. 117(6): p. 1911–6.
OpenUrl Abstract/FREE Full Text
152.
Slager, S.L., et al., Common variation at 6p21.31 (BAK1) influences the risk of chronic lymphocytic leukemia. Blood, 2012. 120(4): p. 843–6.
OpenUrl Abstract/FREE Full Text
153.
Berndt, S.I., et al., Genome-wide association study identifies multiple risk loci for chronic lymphocytic leukemia. Nat Genet, 2013. 45(8): p. 868–76.
OpenUrl CrossRef PubMed
154.
Speedy, H.E., et al., A genome-wide association study identifies multiple susceptibility loci for chronic lymphocytic leukemia. Nat Genet, 2014. 46(1): p. 56–60.
OpenUrl CrossRef PubMed
155.
Berndt, S.I., et al., Meta-analysis of genome-wide association studies discovers multiple loci for chronic lymphocytic leukemia. Nat Commun, 2016. 7: p. 10933.
OpenUrl CrossRef PubMed
156.
Law, P.J., et al., Genome-wide association analysis implicates dysregulation of immunity genes in chronic lymphocytic leukaemia. Nat Commun, 2017. 8: p. 14175.
OpenUrl
157.
Rapley, E.A., et al., A genome-wide association study of testicular germ cell tumor. Nat Genet, 2009. 41(7): p. 807–10.
OpenUrl CrossRef PubMed Web of Science
158.
Turnbull, C., et al., Variants near DMRT1, TERT and ATF7IP are associated with testicular germ cell cancer. Nat Genet, 2010. 42(7): p. 604–7.
OpenUrl CrossRef PubMed Web of Science
159.
Chung, C.C., et al., Meta-analysis identifies four new loci associated with testicular germ cell tumor. Nat Genet, 2013. 45(6): p. 680–5.
OpenUrl CrossRef PubMed
160.
Ruark, E., et al., Identification of nine new susceptibility loci for testicular cancer, including variants near DAZL and PRDM14. Nat Genet, 2013. 45(6): p. 686–9.
OpenUrl CrossRef PubMed
161.
Kristiansen, W., et al., Two new loci and gene sets related to sex determination and cancer progression are associated with susceptibility to testicular germ cell tumor. Hum Mol Genet, 2015. 24(14): p. 4138–46.
OpenUrl CrossRef PubMed
162.
Litchfield, K., et al., Identification of four new susceptibility loci for testicular germ cell tumour. Nat Commun, 2015. 6: p. 8690.
OpenUrl
163.
Litchfield, K., et al., Identification of 19 new risk loci and potential regulatory mechanisms influencing susceptibility to testicular germ cell tumor. Nat Genet, 2017. 49(7): p. 1133–1140.
OpenUrl CrossRef
164.
Wang, Z., et al., Meta-analysis of five genome-wide association studies identifies multiple new loci associated with testicular germ cell tumor. Nat Genet, 2017. 49(7): p. 1141–1147.
OpenUrl CrossRef
165.
Enciso-Mora, V., et al., A genome-wide association study of Hodgkin’s lymphoma identifies new susceptibility loci at 2p16.1 (REL), 8q24.21 and 10p14 (GATA3). Nat Genet, 2010. 42(12): p. 1126–1130.
OpenUrl CrossRef PubMed Web of Science
166.
Frampton, M., et al., Variation at 3p24.1 and 6q23.3 influences the risk of Hodgkin’s lymphoma. Nat Commun, 2013. 4: p. 2549.
OpenUrl PubMed
167.
Cozen, W., et al., A meta-analysis of Hodgkin lymphoma reveals 19p13.3 TCF3 as a novel susceptibility locus. Nat Commun, 2014. 5: p. 3856.
OpenUrl PubMed
168.
Sud, A., et al., Genome-wide association study of classical Hodgkin lymphoma identifies key regulators of disease susceptibility. Nat Commun, 2017. 8(1): p. 1892.
OpenUrl
169.
Sud, A., et al., Genome-wide association study implicates immune dysfunction in the development of Hodgkin lymphoma. Blood, 2018. 132(19): p. 2040–2052.
OpenUrl CrossRef

View the discussion thread.

Posted January 24, 2020.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11752)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14974)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28097)
Molecular Biology (11594)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2886)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
Klein, R.J., et al., Complement factor H polymorphism in age-related macular degeneration. Science, 2005. 308(5720): p. 385–9.
OpenUrl Abstract/FREE Full Text

[2] 2.↵
Buniello, A., et al., The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res, 2019. 47(D1): p. D1005–D1012.
OpenUrl CrossRef PubMed

[3] 3.↵
Visscher, P.M., et al., 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet, 2017. 101(1): p. 5–22.
OpenUrl CrossRef PubMed

[4] 4.↵
Khera, A.V., et al., Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet, 2018. 50(9): p. 1219–1224.
OpenUrl CrossRef PubMed

[5] 5.
Dudbridge, F., Power and predictive accuracy of polygenic risk scores. PLoS Genet, 2013. 9(3): p. e1003348.
OpenUrl CrossRef PubMed

[6] 6.↵
Chatterjee, N., et al., Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet, 2013. 45(4): p. 400–5, 405e1-3.
OpenUrl CrossRef PubMed

[7] 7.↵
Martin, A.R., et al., Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am J Hum Genet, 2017. 100(4): p. 635–649.
OpenUrl CrossRef PubMed

[8] 8.↵
Lawrenson, K., et al., Genome-wide association studies identify susceptibility loci for epithelial ovarian cancer in east Asian women. Gynecol Oncol, 2019. 153(2): p. 343–355.
OpenUrl

[9] 9.↵
Phelan, C.M., et al., Identification of 12 new susceptibility loci for different histotypes of epithelial ovarian cancer. Nat Genet, 2017. 49(5): p. 680–691.
OpenUrl CrossRef

[10] 10.↵
Lee, J.Y., et al., BRCA1/2-negative, high-risk breast cancers (BRCAX) for Asian women: genetic susceptibility loci and their potential impacts. Sci Rep, 2018. 8(1): p. 15263.
OpenUrl

[11] 11.↵
Michailidou, K., et al., Association analysis identifies 65 new breast cancer risk loci. Nature, 2017. 551(7678): p. 92–94.
OpenUrl

[12] 12.↵
Schumacher, F.R., et al., Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat Genet, 2018. 50(7): p. 928–936.
OpenUrl

[13] 13.↵
Tanikawa, C., et al., GWAS identifies two novel colorectal cancer loci at 16q24.1 and 20q13.12. Carcinogenesis, 2018. 39(5): p. 652–660.
OpenUrl

[14] 14.↵
Leo, P.J., et al., Defining the genetic susceptibility to cervical neoplasia-A genome-wide association study. PLoS Genet, 2017. 13(8): p. e1006866.
OpenUrl CrossRef PubMed

[15] 15.↵
Huyghe, J.R., et al., Discovery of common and rare genetic risk variants for colorectal cancer. Nat Genet, 2019. 51(1): p. 76–87.
OpenUrl CrossRef

[16] 16.↵
Sudlow, C., et al., UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med, 2015. 12(3): p. e1001779.
OpenUrl CrossRef PubMed

[17] 17.↵
Zhou, W., et al., Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature Genetics, 2018. 50(9): p. 1335–1341.
OpenUrl CrossRef PubMed

[18] 18.↵
Millard, L.A.C., et al., Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. Int J Epidemiol, 2017.

[19] 19.↵
Wu, P., et al., Developing and Evaluating Mappings of ICD-10 and ICD-10-CM Codes to PheCodes. bioRxiv, 2019: p. 462077.

[20] 20.↵
Shi, X., H. Pashova, and P.J. Heagerty, Comparing healthcare utilization patterns via global differences in the endorsement of current procedural terminology codes. Ann. Appl. Stat., 2017. 11(3): p. 1349–1374.
OpenUrl

[21] 21.↵
Vilhjalmsson, B.J., et al., Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet, 2015. 97(4): p. 576–92.
OpenUrl CrossRef PubMed

[22] 22.↵
Mak, T.S.H., et al., Polygenic scores via penalized regression on summary statistics. Genet Epidemiol, 2017. 41(6): p. 469–480.
OpenUrl

[23] 23.
Lloyd-Jones, L.R., et al., Improved polygenic prediction by Bayesian multiple regression on summary statistics. 2019: p. 522961.

[24] 24.↵
Ge, T., et al., Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 2019. 10(1): p. 1776.
OpenUrl

[25] 25.↵
Zhou, X., P. Carbonetto, and M. Stephens, Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet, 2013. 9(2): p. e1003264.
OpenUrl CrossRef PubMed

[26] 26.↵
Prive, F., et al., Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics, 2018. 34(16): p. 2781–2787.
OpenUrl

[27] 27.↵
Prive, F., H. Aschard, and M.G.B. Blum, Efficient Implementation of Penalized Regression for Genetic Risk Prediction. Genetics, 2019. 212(1): p. 65–74.
OpenUrl Abstract/FREE Full Text

[28] 28.↵
Mavaddat, N., et al., Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes. Am J Hum Genet, 2019. 104(1): p. 21–34.
OpenUrl CrossRef PubMed

[29] 29.
Du, Z., et al., A genome-wide association study of prostate cancer in Latinos. Int J Cancer, 2019.

[30] 30.↵
Shieh, Y., et al., A polygenic risk score for breast cancer in U.S. Latinas and Latin-American women. J Natl Cancer Inst, 2019.

[31] 31.↵
Fritsche, L.G., et al., Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. American Journal of Human Genetics, 2018. 102(6): p. 1048–1061.
OpenUrl

[32] 32.↵
Fritsche, L.G., et al., Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb. PLoS Genet, 2019. 15(6): p. e1008202.
OpenUrl

[33] 33.↵
Song, M., et al., Testing calibration of risk models at extremes of disease risk. Biostatistics, 2015. 16(1): p. 143–54.
OpenUrl CrossRef PubMed

[34] 34.↵
Easton, D.F., et al., Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 2007. 447(7148): p. 1087–93.
OpenUrl CrossRef PubMed Web of Science

[35] 35.
Hunter, D.J., et al., A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet, 2007. 39(7): p. 870–4.
OpenUrl CrossRef PubMed Web of Science

[36] 36.
Stacey, S.N., et al., Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet, 2007. 39(7): p. 865–9.
OpenUrl CrossRef PubMed

[37] 37.
Gold, B., et al., Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci U S A, 2008. 105(11): p. 4340–5.
OpenUrl Abstract/FREE Full Text

[38] 38.
Ahmed, S., et al., Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat Genet, 2009. 41(5): p. 585–90.
OpenUrl CrossRef PubMed Web of Science

[39] 39.
Thomas, G., et al., A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet, 2009. 41(5): p. 579–84.
OpenUrl CrossRef PubMed Web of Science

[40] 40.
Turnbull, C., et al., Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet, 2010. 42(6): p. 504–7.
OpenUrl CrossRef PubMed Web of Science

[41] 41.
Antoniou, A.C., et al., A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptor-negative breast cancer in the general population. Nat Genet, 2010. 42(10): p. 885–92.
OpenUrl CrossRef PubMed Web of Science

[42] 42.
Li, J., et al., A combined analysis of genome-wide association studies in breast cancer. Breast Cancer Res Treat, 2011. 126(3): p. 717–27.
OpenUrl CrossRef PubMed Web of Science

[43] 43.
Fletcher, O., et al., Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl Cancer Inst, 2011. 103(5): p. 425–35.
OpenUrl CrossRef PubMed Web of Science

[44] 44.
Sehrawat, B., et al., Potential novel candidate polymorphisms identified in genome-wide association study for breast cancer susceptibility. Hum Genet, 2011. 130(4): p. 529–37.
OpenUrl CrossRef PubMed

[45] 45.
Rinella, E.S., et al., Genetic variants associated with breast cancer risk for Ashkenazi Jewish women with strong family histories but no identifiable BRCA1/2 mutation. Hum Genet, 2013. 132(5): p. 523–36.
OpenUrl

[46] 46.
Michailidou, K., et al., Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat Genet, 2013. 45(4): p. 353–61, 361e1-2.
OpenUrl CrossRef PubMed

[47] 47.
Garcia-Closas, M., et al., Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nat Genet, 2013. 45(4): p. 392–8, 398e1-2.
OpenUrl CrossRef PubMed

[48] 48.
Gaudet, M.M., et al., Identification of a BRCA2-specific modifier locus at 6p24 related to breast cancer risk. PLoS Genet, 2013. 9(3): p. e1003173.
OpenUrl CrossRef PubMed

[49] 49.
Couch, F.J., et al., Genome-wide association study in BRCA1 mutation carriers identifies novel loci associated with breast and ovarian cancer risk. PLoS Genet, 2013. 9(3): p. e1003212.
OpenUrl CrossRef PubMed

[50] 50.
Michailidou, K., et al., Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet, 2015. 47(4): p. 373–80.
OpenUrl CrossRef PubMed

[51] 51.↵
Milne, R.L., et al., Identification of ten variants associated with risk of estrogen-receptor-negative breast cancer. Nat Genet, 2017. 49(12): p. 1767–1778.
OpenUrl

[52] 52.↵
Euesden, J., C.M. Lewis, and P.F. O’Reilly, PRSice: Polygenic Risk Score software. Bioinformatics, 2015. 31(9): p. 1466–8.
OpenUrl CrossRef PubMed

[53] 53.↵
Beesley, L., et al., The Emerging Landscape of Epidemiological Research Based on Biobanks Linked to Electronic Health Records: Existing Resources, Analytic Challenges and Potential Opportunities. Preprints, 2018. 2018090388.

[54] 54.↵
Orsted, D.D., et al., Association of clinical benign prostate hyperplasia with prostate cancer incidence and mortality revisited: a nationwide cohort study of 3,009,258 men. Eur Urol, 2011. 60(4): p. 691–8.
OpenUrl CrossRef PubMed

[55] 55.↵
Dai, X., et al., Benign Prostatic Hyperplasia and the Risk of Prostate Cancer and Bladder Cancer: A Meta-Analysis of Observational Studies. Medicine (Baltimore), 2016. 95(18): p. e3493.
OpenUrl

[56] 56.↵
Zhang, Y., et al., Assessment of Polygenic Architecture and Risk Prediction based on Common Variants Across Fourteen Cancers. bioRxiv, 2019: p. 723825.

[57] 57.↵
Zhou, W., et al. Global Biobank Meta-analysis Initiative: Powering genetic discovery across human diseases. in 69th Annual Meeting of the American Society of Human Genetics 2019. Houston, Tx, USA.

[58] 58.↵
McCarthy, S., et al., A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 2016. 48(10): p. 1279–83.
OpenUrl CrossRef PubMed

[59] 59.↵
Wei, W.Q., et al., Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS One, 2017. 12(7): p. e0175508.
OpenUrl CrossRef PubMed

[60] 60.↵
Flint, E. and S. Cummins, Active commuting and obesity in mid-life: cross-sectional, observational evidence from UK Biobank. Lancet Diabetes Endocrinol, 2016. 4(5): p. 420–35.
OpenUrl

[61] 61.↵
Lindstrom, S., et al., Quantifying the Genetic Correlation between Multiple Cancer Types. Cancer Epidemiol Biomarkers Prev, 2017. 26(9): p. 1427–1435.
OpenUrl Abstract/FREE Full Text

[62] 62.↵
Torkamani, A., N.E. Wineinger, and E.J. Topol, The personal and clinical utility of polygenic risk scores. Nat Rev Genet, 2018. 19(9): p. 581–590.
OpenUrl CrossRef PubMed

[63] 63.↵
Sirugo, G., S.M. Williams, and S.A. Tishkoff, The Missing Diversity in Human Genetic Studies. Cell, 2019. 177(1): p. 26–31.
OpenUrl

[64] 64.↵
Das, S., et al., Next-generation genotype imputation service and methods. Nat Genet, 2016. 48(10): p. 1284–1287.
OpenUrl CrossRef PubMed

[65] 65.↵
Maas, P., et al., Breast Cancer Risk From Modifiable and Nonmodifiable Risk Factors Among White Women in the United States. JAMA Oncol, 2016. 2(10): p. 1295–1302.
OpenUrl

[66] 66.
Garcia-Closas, M., N.B. Gunsoy, and N. Chatterjee, Combined associations of genetic and environmental risk factors: implications for prevention of breast cancer. J Natl Cancer Inst, 2014. 106(11).

[67] 67.↵
Das, J.K., et al., Ranking and clustering of Drosophila olfactory receptors using mathematical morphology. Genomics, 2019. 111(4): p. 549–559.
OpenUrl

[68] 68.↵
Fritsche, L.G., et al., Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. Am J Hum Genet, 2018. 102(6): p. 1048–1061.
OpenUrl

[69] 69.↵
World Medical Association, World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA, 2013. 310(20): p. 2191–4.
OpenUrl CrossRef PubMed Web of Science

[70] 70.↵
Wang, C., et al., Ancestry estimation and control of population stratification for sequence-based association studies. Nat Genet, 2014. 46(4): p. 409–15.
OpenUrl CrossRef PubMed

[71] 71.↵
Li, J.Z., et al., Worldwide human relationships inferred from genome-wide patterns of variation. Science, 2008. 319(5866): p. 1100–4.
OpenUrl Abstract/FREE Full Text

[72] 72.↵
Manichaikul, A., et al., Robust relationship inference in genome-wide association studies. Bioinformatics, 2010. 26(22): p. 2867–73.
OpenUrl CrossRef PubMed Web of Science

[73] 73.↵
Abraham, K.J. and C. Diaz, Identifying large sets of unrelated individuals and unrelated markers. Source Code Biol Med, 2014. 9(1): p. 6.
OpenUrl

[74] 74.↵
Bycroft, C., et al., Genome-wide genetic data on ∼500,000 UK Biobank participants. bioRxiv, 2017.

[75] 75.↵
Carroll, R.J., L. Bastarache, and J.C. Denny, R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics, 2014. 30(16): p. 2375–6.
OpenUrl CrossRef PubMed Web of Science

[76] 76.↵
Ho, D.E., et al., MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software, 2011. 42(8): p. 1–28.
OpenUrl

[77] 77.↵
Zhou, W., et al., Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet, 2018. 50(9): p. 1335–1341.
OpenUrl CrossRef PubMed

[78] 78.↵
MacArthur, J., et al., The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res, 2017. 45(D1): p. D896–D901.
OpenUrl CrossRef PubMed

[79] 79.↵
Welter, D., et al., The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res, 2014. 42(Database issue): p. D1001–6.
OpenUrl CrossRef PubMed Web of Science

[80] 80.↵
Winkler, T.W., et al., Quality control and conduct of genome-wide association meta-analyses. Nat Protoc, 2014. 9(5): p. 1192–212.
OpenUrl CrossRef PubMed

[81] 81.↵
Ransohoff, K.J., et al., Two-stage genome-wide association study identifies a novel susceptibility locus associated with melanoma. Oncotarget, 2017. 8(11): p. 17586–17592.
OpenUrl

[82] 82.
Chahal, H.S., et al., Genome-wide association study identifies 14 novel risk alleles associated with basal cell carcinoma. Nat Commun, 2016. 7: p. 12510.
OpenUrl CrossRef

[83] 83.↵
Chahal, H.S., et al., Genome-wide association study identifies novel susceptibility loci for cutaneous squamous cell carcinoma. Nat Commun, 2016. 7: p. 12048.
OpenUrl

[84] 84.↵
Kuhn, M., et al., caret: Classification and Regression Training. 2018.

[85] 85.↵
Mangiafico, S., rcompanion: Functions to Support Extension Education Program Evaluation. 2019.

[86] 86.↵
Signorell, A., DescTools: Tools for Descriptive Statistics. 2018.

[87] 87.↵
Robin, X., et al., pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 2011. 12: p. 77.
OpenUrl CrossRef PubMed

[88] 88.↵
Hosmer, D.W. and S. Lemeshow, Applied Logistic Regression. 2010, New York, USA: John Wiley and Sons.

[89] 89.
Lele, S., R., J.L. Keim, and P. Solymos, ResourceSelection: Resource Selection (Probability) Functions for Use-Availability Data. 2017.

[90] 90.↵
Steyerberg, E.W., et al., Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 2010. 21(1): p. 128–38.
OpenUrl CrossRef PubMed Web of Science

[91] 91.↵
Heinze, G., A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat Med, 2006. 25(24): p. 4216–26.
OpenUrl CrossRef PubMed Web of Science

[92] 92.
Heinze, G., et al., logistf: Firth’s bias reduced logistic regression. 2013.

[93] 93.↵
Choi, L. and C. Beck, EHR: Electronic Health Record (EHR) Data Processing and Analysis Tool. 2017.

[94] 94.↵
R Core Team, R: A Language and Environment for Statistical Computing. 2016, R Foundation for Statistical Computing, Vienna, Austria.

[95] 95.
Asgari, M.M., et al., Identification of Susceptibility Loci for Cutaneous Squamous Cell Carcinoma. J Invest Dermatol, 2016. 136(5): p. 930–937.
OpenUrl PubMed

[96] 96.
Siiskonen, S.J., et al., A Genome-Wide Association Study of Cutaneous Squamous Cell Carcinoma among European Descendants. Cancer Epidemiol Biomarkers Prev, 2016. 25(4): p. 714–20.
OpenUrl Abstract/FREE Full Text

[97] 97.
Kiemeney, L.A., et al., Sequence variant on 8q24 confers susceptibility to urinary bladder cancer. Nat Genet, 2008. 40(11): p. 1307–12.
OpenUrl CrossRef PubMed Web of Science

[98] 98.
Wu, X., et al., Genetic variation in the prostate stem cell antigen gene PSCA confers susceptibility to urinary bladder cancer. Nat Genet, 2009. 41(9): p. 991–5.
OpenUrl CrossRef PubMed Web of Science

[99] 99.
Kiemeney, L.A., et al., A sequence variant at 4p16.3 confers susceptibility to urinary bladder cancer. Nat Genet, 2010. 42(5): p. 415–9.
OpenUrl CrossRef PubMed Web of Science

[100] 100.
Rothman, N., et al., A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet, 2010. 42(11): p. 978–84.
OpenUrl CrossRef PubMed Web of Science

[101] 101.
Rafnar, T., et al., European genome-wide association study identifies SLC14A1 as a new urinary bladder cancer susceptibility gene. Hum Mol Genet, 2011. 20(21): p. 4268–81.
OpenUrl CrossRef PubMed Web of Science

[102] 102.
Figueroa, J.D., et al., Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum Mol Genet, 2014. 23(5): p. 1387–98.
OpenUrl CrossRef PubMed Web of Science

[103] 103.
Rafnar, T., et al., Genome-wide association study yields variants at 20p12.2 that associate with urinary bladder cancer. Hum Mol Genet, 2014. 23(20): p. 5545–57.
OpenUrl CrossRef PubMed

[104] 104.
Landi, M.T., et al., A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet, 2009. 85(5): p. 679–91.
OpenUrl CrossRef PubMed Web of Science

[105] 105.
Wang, Y., et al., Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet, 2014. 46(7): p. 736–41.
OpenUrl CrossRef PubMed

[106] 106.
McKay, J.D., et al., Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nat Genet, 2017. 49(7): p. 1126–1132.
OpenUrl

[107] 107.
Byun, J., et al., Genome-wide association study of familial lung cancer. Carcinogenesis, 2018. 39(9): p. 1135–1140.
OpenUrl

[108] 108.
Gudmundsson, J., et al., Common variants on 9q22.33 and 14q13.3 predispose to thyroid cancer in European populations. Nat Genet, 2009. 41(4): p. 460–4.
OpenUrl CrossRef PubMed Web of Science

[109] 109.
Gudmundsson, J., et al., Discovery of common variants associated with low TSH levels and thyroid cancer risk. Nat Genet, 2012. 44(3): p. 319–22.
OpenUrl CrossRef PubMed

[110] 110.
Kohler, A., et al., Genome-wide association study on differentiated thyroid cancer. J Clin Endocrinol Metab, 2013. 98(10): p. E1674–81.
OpenUrl CrossRef PubMed Web of Science

[111] 111.
Mancikova, V., et al., Thyroid cancer GWAS identifies 10q26.12 and 6q14.1 as novel susceptibility loci and reveals genetic heterogeneity among populations. Int J Cancer, 2015. 137(8): p. 1870–8.
OpenUrl CrossRef PubMed

[112] 112.
Gudmundsson, J., et al., A genome-wide association study yields five novel thyroid cancer risk loci. Nat Commun, 2017. 8: p. 14517.
OpenUrl CrossRef

[113] 113.
Spurdle, A.B., et al., Genome-wide association study identifies a common variant associated with risk of endometrial cancer. Nat Genet, 2011. 43(5): p. 451–4.
OpenUrl CrossRef PubMed

[114] 114.
Cheng, T.H., et al., Five endometrial cancer risk loci identified through genome-wide association analysis. Nat Genet, 2016. 48(6): p. 667–674.
OpenUrl CrossRef PubMed

[115] 115.
O’Mara, T.A., et al., Identification of nine new susceptibility loci for endometrial cancer. Nat Commun, 2018. 9(1): p. 3166.
OpenUrl CrossRef

[116] 116.
Brown, K.M., et al., Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat Genet, 2008. 40(7): p. 838–40.
OpenUrl CrossRef PubMed Web of Science

[117] 117.
Bishop, D.T., et al., Genome-wide association study identifies three loci associated with melanoma risk. Nat Genet, 2009. 41(8): p. 920–5.
OpenUrl CrossRef PubMed Web of Science

[118] 118.
Macgregor, S., et al., Genome-wide association study identifies a new melanoma susceptibility locus at 1q21.3. Nat Genet, 2011. 43(11): p. 1114–8.
OpenUrl CrossRef PubMed

[119] 119.
Barrett, J.H., et al., Genome-wide association study identifies three new melanoma susceptibility loci. Nat Genet, 2011. 43(11): p. 1108–13.
OpenUrl CrossRef PubMed

[120] 120.
Iles, M.M., et al., A variant in FTO shows association with melanoma risk not due to BMI. Nat Genet, 2013. 45(4): p. 428–32, 432e1.
OpenUrl CrossRef PubMed

[121] 121.
Law, M.H., et al., Genome-wide meta-analysis identifies five new susceptibility loci for cutaneous malignant melanoma. Nat Genet, 2015. 47(9): p. 987–995.
OpenUrl CrossRef PubMed

[122] 122.
Vijai, J., et al., A genome-wide association study of marginal zone lymphoma shows association to the HLA region. Nat Commun, 2015. 6: p. 5751.
OpenUrl CrossRef PubMed

[123] 123.
McMaster, M.L., et al., Two high-risk susceptibility loci at 6p25.3 and 14q32.13 for Waldenstrom macroglobulinemia. Nat Commun, 2018. 9(1): p. 4182.
OpenUrl

[124] 124.
Skibola, C.F., et al., Genetic variants at 6p21.33 are associated with susceptibility to follicular lymphoma. Nat Genet, 2009. 41(8): p. 873–5.
OpenUrl CrossRef PubMed Web of Science

[125] 125.
Conde, L., et al., Genome-wide association study of follicular lymphoma identifies a risk locus at 6p21.32. Nat Genet, 2010. 42(8): p. 661–4.
OpenUrl CrossRef PubMed Web of Science

[126] 126.
Smedby, K.E., et al., GWAS of follicular lymphoma reveals allelic heterogeneity at 6p21.32 and suggests shared genetic susceptibility with diffuse large B-cell lymphoma. PLoS Genet, 2011. 7(4): p. e1001378.
OpenUrl CrossRef PubMed

[127] 127.
Cerhan, J.R., et al., Genome-wide association study identifies multiple susceptibility loci for diffuse large B cell lymphoma. Nat Genet, 2014. 46(11): p. 1233–8.
OpenUrl CrossRef PubMed

[128] 128.
Skibola, C.F., et al., Genome-wide association study identifies five susceptibility loci for follicular lymphoma outside the HLA region. Am J Hum Genet, 2014. 95(4): p. 462–71.
OpenUrl CrossRef PubMed

[129] 129.
Turnbull, C., et al., A genome-wide association study identifies susceptibility loci for Wilms tumor. Nat Genet, 2012. 44(6): p. 681–4.
OpenUrl CrossRef PubMed

[130] 130.
Scelo, G., et al., Genome-wide association study identifies multiple risk loci for renal cell carcinoma. Nat Commun, 2017. 8: p. 15724.
OpenUrl CrossRef

[131] 131.
Amundadottir, L., et al., Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet, 2009. 41(9): p. 986–90.
OpenUrl CrossRef PubMed Web of Science

[132] 132.
Wu, C., et al., Genome-wide association study of survival in patients with pancreatic adenocarcinoma. Gut, 2014. 63(1): p. 152–60.
OpenUrl Abstract/FREE Full Text

[133] 133.
Wolpin, B.M., et al., Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer. Nat Genet, 2014. 46(9): p. 994–1000.
OpenUrl CrossRef PubMed

[134] 134.
Zhang, M., et al., Three new pancreatic cancer susceptibility signals identified on chromosomes 1q32.1, 5p15.33 and 8q24.21. Oncotarget, 2016. 7(41): p. 66328–66343.
OpenUrl

[135] 135.
Klein, A.P., et al., Genome-wide meta-analysis identifies five new susceptibility loci for pancreatic cancer. Nat Commun, 2018. 9(1): p. 556.
OpenUrl

[136] 136.
Wrensch, M., et al., Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility. Nat Genet, 2009. 41(8): p. 905–8.
OpenUrl CrossRef PubMed Web of Science

[137] 137.
Shete, S., et al., Genome-wide association study identifies five susceptibility loci for glioma. Nat Genet, 2009. 41(8): p. 899–904.
OpenUrl CrossRef PubMed Web of Science

[138] 138.
Sanson, M., et al., Chromosome 7p11.2 (EGFR) variation influences glioma risk. Hum Mol Genet, 2011. 20(14): p. 2897–904.
OpenUrl CrossRef PubMed Web of Science

[139] 139.
Rajaraman, P., et al., Genome-wide association study of glioma and meta-analysis. Hum Genet, 2012. 131(12): p. 1877–88.
OpenUrl CrossRef PubMed

[140] 140.
Walsh, K.M., et al., Variants near TERT and TERC influencing telomere length are associated with high-grade glioma risk. Nat Genet, 2014. 46(7): p. 731–5.
OpenUrl CrossRef PubMed

[141] 141.
Kinnersley, B., et al., Genome-wide association study identifies multiple susceptibility loci for glioma. Nat Commun, 2015. 6: p. 8559.
OpenUrl CrossRef PubMed

[142] 142.
Melin, B.S., et al., Genome-wide association study of glioma subtypes identifies specific differences in genetic susceptibility to glioblastoma and non-glioblastoma tumors. Nat Genet, 2017. 49(5): p. 789–794.
OpenUrl CrossRef PubMed

[143] 143.
Broderick, P., et al., Common variation at 3p22.1 and 7p15.3 influences multiple myeloma risk. Nat Genet, 2011. 44(1): p. 58–61.
OpenUrl CrossRef PubMed

[144] 144.
Weinhold, N., et al., The CCND1 c.870G>A polymorphism is a risk factor for t(11;14)(q13;q32) multiple myeloma. Nat Genet, 2013. 45(5): p. 522–525.
OpenUrl CrossRef PubMed

[145] 145.
Chubb, D., et al., Common variation at 3q26.2, 6p21.33, 17p11.2 and 22q13.1 influences multiple myeloma risk. Nat Genet, 2013. 45(10): p. 1221–1225.
OpenUrl CrossRef PubMed

[146] 146.
Swaminathan, B., et al., Variants in ELL2 influencing immunoglobulin levels associate with multiple myeloma. Nat Commun, 2015. 6: p. 7213.
OpenUrl CrossRef PubMed

[147] 147.
Mitchell, J.S., et al., Genome-wide association study identifies multiple susceptibility loci for multiple myeloma. Nat Commun, 2016. 7: p. 12050.
OpenUrl CrossRef

[148] 148.
Went, M., et al., Identification of multiple risk loci and regulatory mechanisms influencing susceptibility to multiple myeloma. Nat Commun, 2018. 9(1): p. 3707.
OpenUrl

[149] 149.
Xiao, Y., et al., SSBP2 variants are associated with survival in glioblastoma patients. Clin Cancer Res, 2012. 18(11): p. 3154–62.
OpenUrl Abstract/FREE Full Text

[150] 150.
Di Bernardo, M.C., et al., A genome-wide association study identifies six susceptibility loci for chronic lymphocytic leukemia. Nat Genet, 2008. 40(10): p. 1204–10.
OpenUrl CrossRef PubMed Web of Science

[151] 151.
Slager, S.L., et al., Genome-wide association study identifies a novel susceptibility locus at 6p21.3 among familial CLL. Blood, 2011. 117(6): p. 1911–6.
OpenUrl Abstract/FREE Full Text

[152] 152.
Slager, S.L., et al., Common variation at 6p21.31 (BAK1) influences the risk of chronic lymphocytic leukemia. Blood, 2012. 120(4): p. 843–6.
OpenUrl Abstract/FREE Full Text

[153] 153.
Berndt, S.I., et al., Genome-wide association study identifies multiple risk loci for chronic lymphocytic leukemia. Nat Genet, 2013. 45(8): p. 868–76.
OpenUrl CrossRef PubMed

[154] 154.
Speedy, H.E., et al., A genome-wide association study identifies multiple susceptibility loci for chronic lymphocytic leukemia. Nat Genet, 2014. 46(1): p. 56–60.
OpenUrl CrossRef PubMed

[155] 155.
Berndt, S.I., et al., Meta-analysis of genome-wide association studies discovers multiple loci for chronic lymphocytic leukemia. Nat Commun, 2016. 7: p. 10933.
OpenUrl CrossRef PubMed

[156] 156.
Law, P.J., et al., Genome-wide association analysis implicates dysregulation of immunity genes in chronic lymphocytic leukaemia. Nat Commun, 2017. 8: p. 14175.
OpenUrl

[157] 157.
Rapley, E.A., et al., A genome-wide association study of testicular germ cell tumor. Nat Genet, 2009. 41(7): p. 807–10.
OpenUrl CrossRef PubMed Web of Science

[158] 158.
Turnbull, C., et al., Variants near DMRT1, TERT and ATF7IP are associated with testicular germ cell cancer. Nat Genet, 2010. 42(7): p. 604–7.
OpenUrl CrossRef PubMed Web of Science

[159] 159.
Chung, C.C., et al., Meta-analysis identifies four new loci associated with testicular germ cell tumor. Nat Genet, 2013. 45(6): p. 680–5.
OpenUrl CrossRef PubMed

[160] 160.
Ruark, E., et al., Identification of nine new susceptibility loci for testicular cancer, including variants near DAZL and PRDM14. Nat Genet, 2013. 45(6): p. 686–9.
OpenUrl CrossRef PubMed

[161] 161.
Kristiansen, W., et al., Two new loci and gene sets related to sex determination and cancer progression are associated with susceptibility to testicular germ cell tumor. Hum Mol Genet, 2015. 24(14): p. 4138–46.
OpenUrl CrossRef PubMed

[162] 162.
Litchfield, K., et al., Identification of four new susceptibility loci for testicular germ cell tumour. Nat Commun, 2015. 6: p. 8690.
OpenUrl

[163] 163.
Litchfield, K., et al., Identification of 19 new risk loci and potential regulatory mechanisms influencing susceptibility to testicular germ cell tumor. Nat Genet, 2017. 49(7): p. 1133–1140.
OpenUrl CrossRef

[164] 164.
Wang, Z., et al., Meta-analysis of five genome-wide association studies identifies multiple new loci associated with testicular germ cell tumor. Nat Genet, 2017. 49(7): p. 1141–1147.
OpenUrl CrossRef

[165] 165.
Enciso-Mora, V., et al., A genome-wide association study of Hodgkin’s lymphoma identifies new susceptibility loci at 2p16.1 (REL), 8q24.21 and 10p14 (GATA3). Nat Genet, 2010. 42(12): p. 1126–1130.
OpenUrl CrossRef PubMed Web of Science

[166] 166.
Frampton, M., et al., Variation at 3p24.1 and 6q23.3 influences the risk of Hodgkin’s lymphoma. Nat Commun, 2013. 4: p. 2549.
OpenUrl PubMed

[167] 167.
Cozen, W., et al., A meta-analysis of Hodgkin lymphoma reveals 19p13.3 TCF3 as a novel susceptibility locus. Nat Commun, 2014. 5: p. 3856.
OpenUrl PubMed

[168] 168.
Sud, A., et al., Genome-wide association study of classical Hodgkin lymphoma identifies key regulators of disease susceptibility. Nat Commun, 2017. 8(1): p. 1892.
OpenUrl

[169] 169.
Sud, A., et al., Genome-wide association study implicates immune dysfunction in the development of Hodgkin lymphoma. Blood, 2018. 132(19): p. 2040–2052.
OpenUrl CrossRef

Cancer PRSweb – an Online Repository with Polygenic Risk Scores (PRS) for Major Cancer Traits and Their Phenome-wide Exploration in Two Independent Biobanks

Abstract

Introduction

Results

PRS Construction

PRS Evaluation

Comparison of Performance Metrics

Comparison across GWAS Sources

Comparison of Performance Across Methods

Comparison of Performance across Cancers

Comparison of Performance across Cohorts

Phenome-wide Association Analyses

Online Visual Catalog: Cancer PRSweb and R package Rprs

Discussion

Access to independent GWAS summary statistics

Mapping of Trait Definitions

Power Limitations

Transferability of PRS across cohorts

Conclusions

Methods

Evaluation cohorts

MGI cohort

UK Biobank cohort (UKB)

Genotyping, sample quality control and imputation

MGI

UK Biobank

Phenome generation

MGI

UK Biobank

PRS Structure

Sources of GWAS summary statistics

GWAS Catalog

Large GWAS meta-analyses

UK-Biobank-based GWAS

PRS Construction

PRS Evaluation

PRS Association Testing

Phenome-wide Exploration of PRS Associations

Online Visual Catalog: PRSweb

Data availability

Author contributions

Competing interests

Materials & Correspondence

Acknowledgements

Footnotes

Reference

Citation Manager Formats

Subject Area