Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Genetic Architecture of Complex Traits and Disease Risk Predictors

Soke Yuen Yong, Timothy G. Raben, Louis Lello, Stephen D.H. Hsu
doi: https://doi.org/10.1101/2020.02.12.946608
Soke Yuen Yong
1Department of Physics and Astronomy, Michigan State University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: yongsoke@msu.edu
Timothy G. Raben
1Department of Physics and Astronomy, Michigan State University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Louis Lello
1Department of Physics and Astronomy, Michigan State University
2Genomic Prediction, North Brunswick, NJ
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stephen D.H. Hsu
1Department of Physics and Astronomy, Michigan State University
2Genomic Prediction, North Brunswick, NJ
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Genomic prediction of complex human traits (e.g., height, cognitive ability, bone density) and disease risks (e.g., breast cancer, diabetes, heart disease, atrial fibrillation) has advanced considerably in recent years. Predictors have been constructed using penalized algorithms that favor sparsity: i.e., which use as few genetic variants as possible. We analyze the specific genetic variants (SNPs) utilized in these predictors, which can vary from dozens to as many as thirty thousand. We find that the fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits – i.e., existing PRS cannot be computed from exome data alone. We also study the fraction of SNPs and of variance that is in common between pairs of predictors. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.

Introduction

Genomic prediction of complex traits and disease risks has advanced considerably thanks to the recent advent of large data sets and improved algorithms. These algorithms range from simple regression, applied to one SNP at a time to estimate statistical significance and effect size (e.g., as used in GWAS), to high dimensional optimization methods such as compressed sensing or sparse learning [1–4]. They produce Polygenic Risk Scores (PRS) or Polygenic Scores (PGS): functions that map the state of an individual’s DNA at specific locations (SNPs), to a risk score or predicted quantitative trait value.

Predictors (PGS or PRS) now exist for a number of important traits and risks, many of which have undergone out of sample testing (i.e., validation in groups of individuals not used in training and from other data sets or from separate ancestries.)[5–7]. The genetic architectures uncovered vary significantly: the number of SNPs required to capture most of the predictor variance ranges from a few dozen to many thousands. In contrast, traditional Genome Wide Association studies (GWAS) can implicate the entire genome [8, 9], making them unwieldy to analyze.

In the case of disease risk, the predictors are already good enough to identify risk outliers. That is, individuals with unusually high (or low) risk of a specific condition. There are many clinical applications for such predictors [5, 10–19] (although there is still much work to be done to overcome sampling and algorithmic biases and disparity [20, 21]). Below we mention two examples.

Breast Cancer

Certain variants in the BRCA1 and BRCA2 genes are known to elevate Breast Cancer risk significantly [22, 23]. However, these mutations affect no more than a few women per thousand in the general population[24–26]. By contrast, PRS using thousands of common SNP variants can now identify an order of ten times as many women who are in the high-risk category [5, 7, 10, 27, 28]. Standard of Care for high-risk women typically includes additional screening, such as mammograms beginning a decade earlier than for normal risk women. Early detection can also lead to significant cost savings [29]. What can we say about the thousands of common SNPs used in the PRS? Do they overlap with SNPs used in PRS for other conditions (e.g., other cancers)?

Height

Idiopathic Short Stature (ISS) refers to extreme short stature that does not have a diagnostic explanation (e.g., height below 5 foot 2 inches in adult males)[30]. Growth hormone treatment is sometimes prescribed for children who are at risk for this condition, at a cost in the $100k range. Typically, these would be children in the bottom percentiles for height within their age group [31, 32]. However, it is difficult for pediatric endocrinologists, whose responsibility it is to prescribe HGH for these children, to know whether the child is simply passing through a temporary phase of slow growth (and will, by adulthood, reach normal height)[33, 34]. Adult height prediction from DNA (with 95 percent confidence interval roughly ±2 inches) will allow physicians to avoid expensive HGH treatment (with significant potential side-effects) for children who are merely short for their age (late-developing) and are likely to be in the normal range in adulthood.

For the first time, we can begin to address some general questions concerning the genetic architectures of complex traits. In this paper we address the following questions:

  1. What is the (qualitative) genetic architecture of specific disease risks? How many SNPs, where are they, how many genes?

  2. How much of the total risk is controlled by loci in coding vs non-coding regions?

  3. Is exome sequencing data sufficient for computation of Polygenic Risk Scores (PRS)?

  4. How much genetic overlap exists between different disease architectures? With millions of SNPs in the genome it is entirely possible that different diseases have nearly disjoint genetic architectures – i.e., risk is mostly controlled by distinct regions of DNA. On the other hand, we might uncover overlap regions of DNA which affect multiple disease risks simultaneously.

In this paper we consider predictors for a selection of disease conditions/traits: asthma, atrial fibrillation, basal cell carcinoma, breast cancer, coronary artery disease, type-1 diabetes, type-2 diabetes, diastolic blood pressure, educational years, gallstones, glaucoma, gout, heart attack, height, high cholesterol, hypertension, hypothyroidism, malignant melanoma, menopause, pulse rate, and systolic blood pressure. All predictors, except the coronary artery disease predictor, were built by training on case-control phenotype data from the UK Biobank [35, 36] that relied on custom array genotyping (see Appendix C for details). This array was designed to have detailed coverage in areas known to be associated with certain phenotypes, and to contain a wide sampling of the entire human genome. More details of the array design can be found on the UK Biobank website https://www.ukbiobank.ac.uk/scientists-3/uk-biobank-axiom-array/ and in Appendix C. These predictors were derived using the L1 penalized regression (sparse learning) methods found in [5, 6]. Recent algorithmic benchmarking for complex trait prediction in plants and animals has shown that linear methods work as well if not better than non-linear, Bayesian, or deep learning approaches for current data set sizes [37]. Predictors for diastolic blood pressure, systolic blood pressure, and pulse rate are reported for the first time here, but were designed as described in [5]. The coronary artery predictor originated from [7], and the associated minor allele frequencies were obtained using Ensembl’s minor allele frequency calculator [38].

Of course, not all of the genetic variants affecting disease risk have been discovered. The PRS continue to improve as more training data become available. However, the SNPs used in the existing PRS tend to be those that account for the most risk variance. Equivalently, the statistical evidence supporting their association with the disease risk is highest. Relevant SNPs that have yet to be discovered are either common SNPs with very small effect size, or very rare SNPs that are not probed using existing gene arrays.

Variance and Effect Sizes

The majority of this work deals with characterizing the relative sizes of the effect of particular SNPs on the performance of a polygenic predictor. A predictor in this case is a set of weights, {βi}, for a set of SNPs, S. An individual’s phenotype, y, can be described by Embedded Image where xi∈{0, 1, 2} is the number of minor alleles for SNP i,Embedded Image is the predicted value of the phenotype, and ϵ is an error term. Our primary object of interest is the variance of this prediction. The contribution of a single SNP to this variance is expressed in terms of the βi and the minor allele frequency (MAF), fi, as: Embedded Image

In the limit of small MAF, this becomes Embedded Image. The overall variance of the individual’s predicted phenotype can thus be described as: Embedded Image where the final approximation holds when the SNPs are largely uncorrelated. This is true for most minor allele frequencies, and has been checked empirically in particular instances (see for example [6]). For the predictors in this work, there are only a handful of SNPs with correlation of about 0.01 lying within 2000 kilo base pairs of each other. In this sense, the variance due to each SNP can be considered as a linear effect. We can then calculate the variance accounted for by a subset, 𝒥 ⊂ S, of the predictor SNPs, as a percentage of the total variance of the individual’s predicted phenotype: Embedded Image where again, the final approximation holds when the SNPs are uncorrelated.

Results and Analysis

Predictor SNPs in genic regions

We are interested in investigating how predictor SNPs located in genic regions impact our predictors.

The most obvious way to identify predictor SNPs located inside genic regions is to define a SNP as being within a genic region if its genomic coordinates fall between the start-point and end-point coordinates of any protein-coding gene. The GENCODE Release 19 annotation of the human genome [39] (based on the GRCh37.p13 reference human genome assembly [40]) was chosen as the source of our reference set of gene boundary coordinates. Currently, it is still unclear where exactly genic regions end and intergenic regions begin [41–43], and so there is a possibility that this choice of gene boundary coordinates may not be definitive for our purposes. We asked these questions: as these reference gene boundary coordinates are varied, by how much does the separation of predictor SNPs into genic and non-genic categories vary, and how significantly does the influence of the (increasingly large) genic section of the predictor SNPs change?

Figure 1 shows for a selection of disease conditions how the number of the predictor SNPs categorized as located in genic regions - expressed as a percentage of the total number of predictor SNPs for that disease condition - changes as all gene boundaries (according to GENCODE Release 19) are expanded by an increasing number k of kilo base pairs at both ends. At the reference gene boundaries (k = 0), the percentage of predictor SNPs which are genic ranges from about 50% (many disease conditions) to about 60% (gallstones, malignant melanoma, atrial fibrillation); while at k = 30, this percentage rises to between 60% (breast cancer, type-1 diabetes, education years) to 75% (gallstones, malignant melanoma). This description excludes the coronary artery disease predictor (42.5% to 55%), which has a distinctly low proportion of genic predictor SNPs. For all disease conditions, the increase in the genic percentage of predictor SNPs with k occurs at roughly the same rate, which appears to be almost linear. This suggests that the predictor SNPs for every disease condition are approximately uniformly distributed by distance outside the reference gene boundary coordinates.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Plots of the number of predictor SNPs located within genic regions, expressed as a percentage of the total number of predictor SNPs for that disease condition, against expansion of GENCODE Release 19 gene boundaries by k kilo base pairs.

Figure 2 shows for each disease condition how the variance accounted for by the predictor SNPs located in genic regions - expressed as a percentage of the total variance accounted for by all predictor SNPs for that condition - changes as the boundaries of every gene (according to GENCODE Release 19) are expanded by k kilo base pairs at both ends. At the reference gene boundaries (k = 0), the percentage of predictor variance accounted for by SNPs in genic regions ranges from about 40% (breast cancer, type-1 diabetes) to 90% (gallstones, gout, malignant melanoma), with notable outliers at 25% (atrial fibrillation) and 20% (coronary artery disease). For the majority of disease conditions, the percentage of variance accounted for by the genic predictor SNPs remains approximately flat as k is increased, meaning that practically all the variance accounted for by genic predictor SNPs is due to SNPs contained within the reference gene boundaries. Noticeable exceptions occur for glaucoma at k = 6.5, breast cancer at k = 17.5, and menopause at k = 27.5, where the observed large jumps in variance indicate the presence of some SNP(s) at that genomic location with a significant effect on that specific disease condition.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

Plots of the variance accounted for by predictor SNPs located within genic regions, expressed as a percentage of the total variance accounted for by all predictor SNPs for that disease condition, against expansion of GENCODE Release 19 gene boundaries by k kilo base pairs..

Now that the extent of the variance accounted for by SNPs located within genic regions has been established for each disease condition, it is natural to investigate next how this genic variance is distributed between individual (protein-coding) genes. When considering a specific disease condition, for each gene, the variance accounted for by all predictor SNPs located within the (effective) gene boundary coordinates is summed and expressed as a percentage of the total variance accounted for by all predictor SNPs for that condition. For the purposes of this calculation, the boundaries of the genic regions were chosen to be at k = 30.

Figures 8 through 28 in Appendix A show the percentage of predictor variance accounted for by single genes for asthma, atrial fibrillation, basal cell carcinoma, breast cancer, coronary artery disease, type-1 diabetes, type-2 diabetes, diastolic blood pressure, educational years, gallstones, glaucoma, gout, heart attack, height, high cholesterol, hypertension, hypothyroidism, malignant melanoma, menopause, pulse rate, and systolic blood pressure. Only the fifteen largest values of variance accounted for by a single gene are displayed for each condition. As each genic predictor SNP may lie within the boundaries of more than one gene due to the expanded gene boundaries overlapping, multiple genes may share the exact same set of predictor SNPs and therefore the same value of total variance accounted for by single genes.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

The percentage of predictor SNPs which are found both in genic regions and the UK Biobank exome data, for each disease condition. The disease conditions are listed from left to right on the horizontal axis in order of decreasing percentage. Each vertical bar is colored red with a depth of shade proportional to the height of the bar. Here, “genic” SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4:

The variance accounted for by predictor SNPs which are both in genic regions and detected by the UK Biobank exome data, as a percentage of the total variance accounted for by all predictor SNPs, for each disease condition. The disease conditions are listed from left to right on the horizontal axis in order of decreasing percentage. Each vertical bar is colored blue with a depth of shade proportional to the height of the bar. Here, “genic” SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5:

The breakdown of the percentage of predictor SNPs according to whether their location is genic and whether the exome data serves to probe them, for each predictor. The bar sections representing predictor SNPs in genic regions and in the exome data are labelled ‘Genic/Exonic’ and colored blue, those representing predictor SNPs not in genic regions but present in the exome data are ‘Non-genic/Exonic’ and colored yellow, those representing predictor SNPs which are located in genic regions but not found in the exome data are ‘Genic/Non-exonic’ and colored green, and those representing predictor SNPs neither in genic regions nor in the exome data are ‘Non-genic/Non-exonic’ and colored red. As expected, the yellow ‘Non-genic/Exonic’ bar sections are too small to be discernible. Here, “genic” SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 6:
  • Download figure
  • Open in new tab
Figure 6:

Table containing the pairwise overlap between predictors in terms of the number of predictor SNPs in common, expressed as a percentage of the total number of SNPs in the predictor for the condition labelling each row. Pairs of SNPs less than 4000 base pairs apart were considered to be identified with one another. For a particular row, the number in each column represents the percentage of SNPs from the row predictor that can be identified with the SNPs from the column predictor.

Figure 7:
  • Download figure
  • Open in new tab
Figure 7:

As described by Eq.(5), this table contains the pairwise overlap between predictors in terms of the variance accounted for by predictor SNPs in common, expressed as a percentage of the total variance accounted for by the SNPs belonging to the predictor for the condition labelling each row. The overlapping variance is weighted according to the sign of the correlation between each pair of associated SNPs 4000 kilo base pairs or less apart. This can cause diagonal elements to be less than one hundred percent, as anti-correlated SNPs may be included in the overlap calculation.

It is evident that certain disease conditions have one (or perhaps two or three) dominating value(s) of variance accounted for by a single gene. The most striking results are found with basal cell carcinoma (one gene, IRF4, supplying 28% of the predictor variance out of the 87% total genic variance at k = 30 and a second, TGM3, supplying 12% of the predictor variance), breast cancer (two genes, FGFR2 and TOX3, each supplying 14%-15% of the predictor variance out of 66% total genic variance), type-2 diabetes (one gene, TCF7L2, supplying 25% of the predictor variance out of 81% total genic variance), gallstones (three genes, ABCG8, ABCG5 and DYNC2LI1, each supplying 80%-83% of the predictor variance out of 98% total genic variance), glaucoma (two genes, ALDH9A1 and TMCO1, each supplying 19% of the predictor variance out of 78% total genic variance), gout (one gene, ABCG2, supplying 57% of the predictor variance out of 96% total genic variance and another, SLC2A9, supplying 12% of the predictor variance), heart attack (one gene, LPA, supplying 21% of the predictor variance out of 78% total genic variance), malignant melanoma (five genes, AC092143.1, TUBB3, TCF25, MC1R and DEF8, each supplying 37%-43% of the predictor variance out of 92% total genic variance), and menopause (one gene, UTY, supplying 24% of the predictor variance out of 94% total genic variance).

The more novel among these results are the strong links observed between glaucoma and ALDH9A1, between gallstones and DYNC2LI1, between menopause and UTY, and between malignant melanoma and the four genes AC092143.1, TUBB3, TCF25, and DEF8. These relationships are either not well-established up until now, or have not yet been suggested to be possible.

The rest of our findings confirm previous work by other groups. The association of IRF4 [44] and TGM3 with basal cell carcinoma is well-known, as is the relationship of FGFR2 [46] and TOX3 [47] to breast cancer, the link between TCF7L2 [48] and type-2 diabetes, the association of ABCG8 [49] and ABCG5 [50] with gallstones, the role of TMCO1 [51] in glaucoma, the relationship of ABCG2 [52, 53] and SLC2A9 [54] to gout, the connection between LPA [55] and heart attack / coronary artery disease, and the role of MC1R [56, 57] in malignant melanoma.

For every disease condition considered here, the full list of genes responsible for the top fifteen values of variance displayed here can be found in Appendix B.

Overlap between predictor SNPs and whole-exome sequencing data

Modern whole-exome sequencing techniques are expected to be able to access about 85% of known disease-related variants [58, 59]. Assuming this is correct, we would expect about the same fraction of genic SNPs belonging to our predictors for disease conditions to be identifiable via exome-sequencing data. To verify this, we compared our sets of predictor SNPs against the whole-exome sequencing data released by the UK Biobank in March 2019 [60] (to be specific, the version of the whole-exome sequencing data set generated using a Functionally Equivalent (FE) pipeline [61]). In this section, once again, genic SNPs are defined as those SNPs located within the GENCODE Release 19 protein-coding gene boundaries extended by 30 kilo base pairs at both ends (or k = 30).

Figure 3 shows for each disease condition the percentage of predictor SNPs located in genic regions which are also found in the UK Biobank exome data, where the disease conditions are listed from left to right on the horizontal axis in order of decreasing percentage. For about two-thirds of the conditions surveyed, 10% or so of the genic SNPs for each predictor also formed part of the set of SNPs identified via exome-sequencing. The remaining disease conditions have up to about 17% (in terms of numbers) of their genic predictor SNPs detected by the exome-sequencing data set - with one exception, coronary artery disease, which displays an extraordinarily low (2%) value of this overlap.

Figure 4 shows for each disease condition the variance accounted for by the genic predictor SNPs from figure 3, expressed as a percentage of the total variance accounted for by all the SNPs in the predictor, where the disease conditions are listed from left to right on the horizontal axis in order of decreasing percentage. For the majority of conditions, 20% or less of the total variance accounted for by the predictor SNPs comes from genic SNPs which show up in the UK Biobank exome data. In fact, less than 5% of the variance accounted for by the the breast cancer, atrial fibrillation, and coronary artery disease predictor SNPs is detected by the exome data. Exceptions to this are the predictors for gallstones (80% of predictor variance detected by the exome data), gout (70% of predictor variance detected by the exome data), malignant melanoma (45% of predictor variance identified by the exome data), and menopause (30% of predictor variance identified by the exome data).

It may be worth noting that the ordering of the predictors by percentage of predictor SNPs which are genic (in Figure 1) and the ordering according to percentage of predictor SNPs which are both genic and accessible via the exome data in (Figure 3) is about the same. A similar observation can be made regarding Figures 2 and 4 (here in terms of variance accounted for).

In Figure 5, an overall view of the extent to which predictor SNPs in genic and non-genic regions are identifiable based on exome-sequencing data is given. Figure 5 shows for each disease condition the breakdown of the percentage of predictor SNPs according to whether their location is genic and whether the exome data serves to probe them. This breakdown does not vary much between predictors: in general about 10%-17% of predictor SNPs are both in genic regions and found in the exome data (‘Genic/Exonic’, colored blue), 0% (as might be expected) are not in genic regions but are present in the exome data (‘Non-genic/Exonic’, colored yellow), 55%-65% are located in genic regions but not found in the exome data (‘Genic/Non-exonic’, colored green), and 20%-35% are neither in genic regions nor in the exome data (‘Non-genic/Non-exonic’, colored red). The only deviation comes from the coronary artery disease predictor SNPs - less than 5% are ‘Genic/Exonic’, while nearly 45% are ‘Non-genic/Non-exonic’.

Pairwise comparison of predictors

We now focus on finding connections between disease conditions based on similarities in their predictors. In this section, the analysis involving the coronary artery disease predictor was restricted to the top twenty thousand predictor SNPs as ranked by value of variance accounted for.

Figure 6 shows the pairwise overlap between disease conditions in terms of the number of predictor SNPs that each pair of conditions has in common. Here, pairs of SNPs less than 4000 base pairs apart were considered to be essentially the same SNP (where this separation was chosen so that most or all SNP pairs with high linkage disequilibrium levels are expected to be identified with one another [62]). Each row of the table corresponds to a particular disease condition, and reading the row from left to right (going from one column to the next) gives the number of predictor SNPs (expressed as a percentage of the total number of SNPs in the row-label predictor) that the conditions labelling each column share with the condition that the row corresponds to.

A pair of conditions may be considered to have a significant connection if the percentage of SNPs in common is substantial when read off the table both ways. Analyzing the table in this manner produces two notable groupings, with all the conditions in each group having large pairwise overlap.

  1. Asthma – diastolic blood pressure – hypertension – systolic blood pressure – education years – height: The first four disease conditions form a combination which is not too surprising, but the same cannot be said about the last two traits. The table shows that all possible pairs taken from these six conditions overlap by about 10% or so, with the following exceptions which exceed this level by some way: 38% of the systolic blood pressure SNPs also belong to the diastolic blood pressure predictor, while 35% of the diastolic blood pressure SNPs also belong to the systolic blood pressure predictor; 17% of the diastolic blood pressure SNPs belong to the hypertension predictor, while 18% of the hypertension SNPs belong to the diastolic blood pressure predictor; and 18% of the hypertension SNPs belong to the systolic blood pressure predictor, while 19% of the systolic blood pressure SNPs belong to the hypertension predictor.

  2. Basal cell carcinoma – malignant melanoma: 7.8% of the basal cell carcinoma SNPs belong to the malignant melanoma predictor, while 8.1% of the malignant melanoma SNPs belong to the basal cell carcinoma predictor.

Next, variance accounted for is used to measure the overlap, as seen in figure 7. This results in the discovery of more relations between different predictors. The method used on each pair of predictors was as follows: For each SNP, i, in the first predictor (corresponding to the condition labelling the row), all SNPs, j, from the second predictor (corresponding to the condition labelling the column) located less than 4000 base pairs away were identified. Every such associated SNP from the second predictor was assigned a weight of uniform magnitude, with the sign of the weight based on the sign of the SNP’s effect size relative to the sign of the effect size of the SNP from the first predictor - positive when the signs were the same, and negative when the signs were different. These weights were then multiplied by the variance due to the SNP from the first predictor and summed. If we label each set of SNPs within 4000 base pairs away as 𝒞i, this correlation estimate, Embedded Image can be expressed as Embedded Image where the normalization is chosen to be the total predictor variance of the row predictor, Embedded Image

This produced a weighted overlap in terms of variance, that accounts for whether the associated SNP pairs are positively correlated (effect sizes have equal signs) or negatively correlated (effect sizes have differing signs). Figure 7 displays this overlap as a percentage, and is basically Figure 6 expressed in terms of variance (weighted according to correlation sign) accounted for by the overlapping predictor SNPs.

Once again, significant connections are observed in the case of diastolic blood pressure – hypertension – pulse rate – systolic blood pressure and basal cell carcinoma – malignant melanoma. In the case of the first group of conditions, the overlap of weighted variance ranges from 8% (systolic blood pressure – pulse rate) to 51% (systolic blood pressure – diastolic blood pressure). In the second case, 10% of the basal cell carcinoma predictor variance is shared with the malignant melanoma predictor, while 58% of the malignant melanoma predictor variance belongs to the basal cell carcinoma predictor.

We now also have groups of conditions which appear to be strongly positively correlated in terms of variance overlap, but were unremarkable in terms of number of SNPs in common. The most obvious is coronary artery disease – heart attack – high cholesterol – hypertension, where the magnitude of the pairwise overlap in terms of weighted variance ranges from 4% (hypertension – coronary artery disease) to 35% (high cholesterol – heart attack). Other single pairs of conditions with large overlapping variances are gout – high cholesterol (where 62% of the gout predictor variance also belongs to the high cholesterol predictor, while 35% of the high cholesterol predictor variance also belongs to the gout predictor), type-1 diabetes – type-2 diabetes (where 23% of the type-1 diabetes predictor variance also belongs to the type-2 diabetes predictor, while 30% of the type-2 diabetes predictor variance also belongs to the type-1 diabetes predictor), coronary artery disease – type-2 diabetes (4% and 25%), and gallstones – high cholesterol (85% and 4%). Also worth mentioning is the overlap of height with respectively: education years, asthma, diastolic blood pressure, hypertension, pulse rate, systolic blood pressure - which all are about 10%.

Conclusions

This paper explores the genetic architectures of a number of common disease conditions and complex traits, as revealed by the most important SNPs used in genomic predictors.

The results are complex – primarily summarized in the many figures in the paper and Appendix. However, we can make some general statements:

  1. The fraction of SNPs in or near genic regions varies widely by phenotype. For example, in the case of Coronary Artery Disease and Atrial Fibrillation, less than 20-30 percent of the total risk variance is due to SNPs near genic regions.

  2. For the majority of disease conditions studied, most of the variance is accounted for by SNPs whose state cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits. Stated somewhat differently: exome sequencing data for a specific individual misses much of the information necessary to compute their PRS score!

  3. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated.

Observation III has interesting implications for pleiotropy [63–65]. We found that genetic risks are largely uncorrelated for different conditions. This suggests that there can exist individuals with, e.g., low risk simultaneously in each of multiple conditions, for any essentially any combination of conditions. There is no trade-off required between different disease risks (at least, not among the ones studied here). One could speculate that a lucky individual with exceptionally low risk across multiple conditions might have an unusually long life expectancy.

Of course, the same applies for high risk: some unlucky individuals have high risk for multiple conditions simultaneously. In fact, there appear to be combinations of SNPs that could make a specific individual an outlier in each of the conditions studied, simultaneously.

Note

Recently, it was pointed out [66] that the processing of the whole-exome sequencing data via the FE pipeline had been carried out in a manner that failed to take into account the presence of alternative contigs in the GRCh38 reference genome. This is expected to have led to fewer variants being called than there should be in the resultant data set. Out of the total of 204,829 genomic regions comprising 39.20 MBp of the human genome targeted by the whole-exome sequencing process, data from 7554 regions extending across 1.53 MBp were potentially affected by this error.

In an analysis of this issue [67], Jia et al. compared the number of exome variants per gene identified when using whole-exome sequencing data from the UK Biobank versus using data from gnomAD. They found 641 genes for which the UK Biobank exome data contains no variants whatsoever. In contrast, they calculated that it is highly probable for the UK Biobank exome data to identify at least one variant per gene in the case of the vast majority (93%) of these 641 genes.

With the aim of gauging the extent to which our results may have been impacted by this discrepancy, we examined the overlap between our lists of top genes ranked by variance accounted for by predictor SNPs (Figures 29 - 49) and the 641 potentially problematic genes. We found that for most (17 out of 21) conditions, the genes responsible for the top fifteen values of variance accounted for do not include any of the 641 potentially problematic genes.

The exceptions to this are asthma, basal cell carcinoma, type-1 diabetes, and hypothyroidism. To be specific: Asthma has 2 genes (HLA-DQB1 and HLA-DQA1 with variance 2%-3% each) out of its top 18 genes included among the 641 potentially problematic genes. For comparison, asthma has 3% as the highest percentage of predictor variance accounted for by a single gene. Basal cell carcinoma has 1 gene (HLA-DQA1 with variance 2%) out of its top 20 genes included among the 641 potentially problematic genes, while 28% is the highest fraction of predictor variance accounted for by a single gene. Type-1 diabetes has 12 genes (HSPA1L, HLA-DRB1, BTNL2, HLA-DQB1, HLA-DQB2, NEU1, HLA-DOA, HLA-DQA1, HLA-DRA, HSPA1B, C6orf48, LSM2) out of its top 25 genes among the 641 potentially problematic genes. These 12 genes include the top four genes ranked by variance accounted for, with 14%, 9%, 7%, and 4% of predictor variance respectively. Hypothyroidism has 4 genes (HLA-DPA1, HLA-DQB1, HLA-DQA1, and HLA-DPB1, where two have 5% of predictor variance each and two have 1% each) out of its top 17 genes included among the 641 potentially problematic genes, while 8% is the highest value of predictor variance accounted for by a single gene.

Clearly, for all conditions except type-1 diabetes, the 641 potentially problematic genes play very little part in determining the variance accounted for by the predictor SNPs. We feel that this justifies our opinion that the upcoming corrections to the UK Biobank exome data will not qualitatively change our findings as regards these conditions. There is a possibility, of course, that there could be a significant shift in our results for the type-1 diabetes predictor.

Competing Interests

Stephen Hsu is a shareholder and serves on the board of directors of Genomic Prediction, Inc. Louis Lello joined the company, becoming an employee and shareholder, during the writing and submission of this paper. The other authors have no commercial interests relevant to the research.

Acknowledgements

SY, TR, LL, and SH acknowledge support from the Office of the Vice-President for Research at MSU. LL was also supported by Genomic Prediction, Inc. during part of this project. This work was supported in part by Michigan State University through computational resources provided by the Institute for Cyber-Enabled Research. This research was conducted using the UK Biobank Resource under UK Biobank Main Application 15326.

Appendix A

Figure 8:
  • Download figure
  • Open in new tab
Figure 8:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the asthma predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 9:
  • Download figure
  • Open in new tab
Figure 9:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the atrial fibrillation predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 10:
  • Download figure
  • Open in new tab
Figure 10:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the basal cell carcinoma predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 11:
  • Download figure
  • Open in new tab
Figure 11:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the breast cancer predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 12:
  • Download figure
  • Open in new tab
Figure 12:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the coronary artery disease predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 13:
  • Download figure
  • Open in new tab
Figure 13:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the type-1 diabetes predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 14:
  • Download figure
  • Open in new tab
Figure 14:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the type-2 diabetes predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 15:
  • Download figure
  • Open in new tab
Figure 15:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the diastolic blood pressure predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 16:
  • Download figure
  • Open in new tab
Figure 16:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the education years predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 17:
  • Download figure
  • Open in new tab
Figure 17:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the gallstones predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 18:
  • Download figure
  • Open in new tab
Figure 18:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the glaucoma predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 19:
  • Download figure
  • Open in new tab
Figure 19:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the gout predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 20:
  • Download figure
  • Open in new tab
Figure 20:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the heart attack predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 21:
  • Download figure
  • Open in new tab
Figure 21:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the height predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 22:
  • Download figure
  • Open in new tab
Figure 22:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the high cholesterol predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 23:
  • Download figure
  • Open in new tab
Figure 23:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the hypertension predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 24:
  • Download figure
  • Open in new tab
Figure 24:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the hypothyroidism predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 25:
  • Download figure
  • Open in new tab
Figure 25:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the malignant melanoma predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 26:
  • Download figure
  • Open in new tab
Figure 26:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the menopause predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 27:
  • Download figure
  • Open in new tab
Figure 27:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the pulse rate predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Figure 28:
  • Download figure
  • Open in new tab
Figure 28:

The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the systolic blood pressure predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, ‘genic’ SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.

Appendix B

Tables that list the top genes, as ordered by variance accounted for, for various phenotypes.

Figure 29:
  • Download figure
  • Open in new tab
Figure 29:

For the asthma predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 30:
  • Download figure
  • Open in new tab
Figure 30:

For the atrial fibrillation predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 31:
  • Download figure
  • Open in new tab
Figure 31:

For the basal cell carcinoma predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 32:
  • Download figure
  • Open in new tab
Figure 32:

For the breast cancer predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 33:
  • Download figure
  • Open in new tab
Figure 33:

For the coronary artery disease predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 34:
  • Download figure
  • Open in new tab
Figure 34:

For the type-1 diabetes predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 35:
  • Download figure
  • Open in new tab
Figure 35:

For the type-2 diabetes predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 36:
  • Download figure
  • Open in new tab
Figure 36:

For the diastolic blood pressure predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 37:
  • Download figure
  • Open in new tab
Figure 37:

For the education years predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 38:
  • Download figure
  • Open in new tab
Figure 38:

For the gallstones predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 39:
  • Download figure
  • Open in new tab
Figure 39:

For the glaucoma predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 40:
  • Download figure
  • Open in new tab
Figure 40:

For the gout predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 41:
  • Download figure
  • Open in new tab
Figure 41:

For the heart attack predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 42:
  • Download figure
  • Open in new tab
Figure 42:

For the height predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 43:
  • Download figure
  • Open in new tab
Figure 43:

For the high cholesterol predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 44:
  • Download figure
  • Open in new tab
Figure 44:

For the hypertension predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 45:
  • Download figure
  • Open in new tab
Figure 45:

For the hypothyroidism predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 46:
  • Download figure
  • Open in new tab
Figure 46:

For the malignant melanoma predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 47:
  • Download figure
  • Open in new tab
Figure 47:

For the menopause predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 48:
  • Download figure
  • Open in new tab
Figure 48:

For the pulse rate predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Figure 49:
  • Download figure
  • Open in new tab
Figure 49:

For the systolic blood pressure predictor: list of genes responsible for the top fifteen values of variance accounted for by single genes, and the corresponding variance values, both explicit and expressed as a percentage of the total predictor variance.

Appendix C: UK Biobank Data

C.1 Array Sequencing Data

Predictors (with the exception of the CAD predictor from [7]) were originally trained using the 2018 release of the UK Biobank. Predictor training was restricted to genetically British individuals (as defined using ancestry principal component analysis performed by UK Biobank) [35, 36]. In 2018, the UK Biobank re-released the dataset representing approximately 500,000 individuals genotyped on two Affymetrix platforms - approximately 50,000 samples on the UKB BiLEVE Axiom array and the remainder on the UKB Biobank Axiom array. The genotype information was collected for 488,377 individuals, and 805,426 SNPs, which were then subsequently imputed to a much larger number of SNPs. More details about the design of the array can be found on these documents from the UK Biobank website: an Axiom array content summary: http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-Content-Summary-2014-1.pdf, and the document detailing the Axiom array: http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-Datasheet-2014-1.pdf. Further information about the genotyping and phenotyping used to build the original predictors can be found in [5, 7].

C.2 Exome Sequencing Data

In March 2019, the UK Biobank released whole-exome sequencing (WES) data for 49,960 participants [60]. Selection of participants for the study prioritized individuals with whole-body MRI imaging data from the UK Biobank Imaging Study, enhanced baseline measurements, hospital episode statistics (HES), linked primary care records, and admission to hospital with a primary diagnosis of asthma. In regards to age, sex and ancestry, the sequenced individuals are representative of the overall UK Biobank cohort. The sample set has 194 parent-offspring pairs, including 26 mother-father-child trios, 613 full-sibling pairs, 1 monozygotic twin pair and 195 second-degree genetically determined relationships.

Exomes were captured using a version of the IDT xGen Exome Research Panel v1.0. Multiplexed samples were sequenced with dual-indexed 75 × 75 bp paired-end reads on the Illumina NovaSeq 6000 platform using S2 flow cells. The specific genomic regions targeted for sequencing covered 39 megabases of the human genome, corresponding to 19,396 genes. In addition, the regions measuring 100 bp and located directly upstream and downstream of each target region were also sequenced.

A total of 4,735,722 variants located in targeted regions were identified. With adjacent (non-targeted) 100 bp regions included in the tally, a total of 9,693,526 indel and single nucleotide variants (SNVs) were observed. While only the target regions are required to meet all sequencing quality standards such as unique read coverage, variants in both target and adjacent regions were subjected to the same variant quality control metrics. Approximately 14% of coding variants identified via whole-exome sequencing were observed in the imputed sequence of 49,797 participants with both whole-exome sequencing and imputed data. 22.6% of the coding variants in the imputed data were not observed in the whole-exome sequencing data.

Footnotes

  • ↵† rabentim{at}msu.edu

  • ↵‡ lellolou{at}msu.edu

  • ↵§ hsu{at}msu.edu

References

  1. 1.↵
    Vattikuti, S., Lee, J. J., Chang, C. C., Hsu, S. D. & Chow, C. C. Applying compressed sensing to genome-wide association studies. GigaScience 3, 10 (2014) (cit. on p. 1).
    OpenUrlCrossRefPubMed
  2. 2.
    Ho, C. M. & Hsu, S. D. Determination of nonlinear genetic architecture using compressed sensing. GigaScience 4, 44 (2015) (cit. on p. 1).
    OpenUrlCrossRef
  3. 3.
    Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics 88, 76–82 (2011) (cit. on p. 1).
    OpenUrlCrossRefPubMed
  4. 4.↵
    Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. The American Journal of Human Genetics 97, 576–592 (2015) (cit. on p. 1).
    OpenUrlCrossRefPubMed
  5. 5.↵
    Lello, L., Raben, T. G., Yong, S. Y., Tellier, L. C. & Hsu, S. D. H. Genomic prediction of 16 complex disease risks including heart attack, diabetes, breast and prostate cancer. Sci Rep 9, 2019 (2019) (cit. on pp. 1, 2, 35).
    OpenUrl
  6. 6.↵
    Lello, L. et al. Accurate genomic prediction of human height. Genetics 210, 477–497 (2018) (cit. on pp. 1–3).
    OpenUrlAbstract/FREE Full Text
  7. 7.↵
    Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature genetics 50, 1219 (2018) (cit. on pp. 1, 2, 35).
    OpenUrlCrossRefPubMed
  8. 8.↵
    Marigorta, U. M., Rodriguez, J. A., Gibson, G. & Navarro, A. Replicability and Prediction: Lessons and Challenges from GWAS. Trends in Genetics 34, 504–517 (2018) (cit. on p. 1).
    OpenUrlCrossRef
  9. 9.↵
    Tam, V. et al. Benefits and limitations of genome-wide association studies. Nature Reviews Genetics 20, 467–484 (2019) (cit. on p. 1).
    OpenUrl
  10. 10.↵
    Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nature Reviews Genetics 17, 392–406 (2016) (cit. on p. 2).
    OpenUrlCrossRefPubMed
  11. 11.
    Euesden, J., Lewis, C. M. & O’Reily, P. F. PRSice: Polygenic Risk Score software. Bioinformatics 31, 1466–1468 (2015) (cit. on p. 2).
    OpenUrlCrossRefPubMed
  12. 12.
    Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nature Reviews Genetics 19, 581–590 (2018) (cit. on p. 2).
    OpenUrlCrossRefPubMed
  13. 13.
    Shieh, Y. et al. Breast cancer risk prediction using a clinical risk model and polygenic risk score. Breast Cancer Research and Treatment 159, 513–525 (2016) (cit. on p. 2).
    OpenUrl
  14. 14.
    Lewis, C. M. & Vassos, E. in Genome Med (9(96), 2017) (cit. on p. 2).
  15. 15.
    Abraham, G. & Inouye, M. Genomic risk prediction of complex human disease and its clinical application. Current Opinion in Genetics & Development 33, 10–16 (2015) (cit. on p. 2).
    OpenUrl
  16. 16.
    Priest, J. R. & Ashley, E. A. Genomics in clinical practice. BMJ Heart 100, 1569–1570 (2014) (cit. on p. 2).
    OpenUrl
  17. 17.
    Jacob, H. J. et al. Genomics in clinical practice: lessons from the front lines. Science translational medicine 5.American Association for the Advancement of Science (2013) (cit. on p. 2).
  18. 18.
    Veenstra, D. L., Roth, J. A., Garrison, L. P., Ramsey, S. D. & Burke, W. A formal risk-benefit framework for genomic tests: facilitating the appropriate translation of genomics into clinical practice. Genetics in Medicine 12. Nature Publishing Group, 686–693 (2010) (cit. on p. 2).
  19. 19.↵
    Bowdin, S. et al. Recommendations for the integration of genomics into clinical practice. Genetics in Medicine 18, 1075–1084 (2016) (cit. on p. 2).
    OpenUrl
  20. 20.↵
    Francisco, M & Bustamante, C. D. Polygenic risk scores: a biased prediction? Genome medicine 10. BioMed Central, 1–3 (2018) (cit. on p. 2).
  21. 21.↵
    Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nature Genetics 51, 584–591 (2019) (cit. on p. 2).
    OpenUrlCrossRefPubMed
  22. 22.↵
    Nelson, H. D., Pappas, M., Cantor, A., Haney, E. & Holmes, R. Risk assessment, genetic counseling, and genetic testing for BRCA-related cancer in women: updated evidence report and systematic review for the US Preventive Services Task Force. Jama 322, 666–685 (2019) (cit. on p. 2).
    OpenUrl
  23. 23.↵
    Amir, E., Freedman, O. C., Seruga, B. & Evans, D. G. Assessing women at high risk of breast cancer: a review of risk assessment models. JNCI: Journal of the National Cancer Institute 102. Oxford University Press, 680–691 (2010) (cit. on p. 2).
  24. 24.↵
    Offit, K. BRCA Mutation Frequency and Penetrance: New Data, Old Debate. JNCI: Journal of the National Cancer Institute 98, 23 (2006) (cit. on p. 2).
    OpenUrl
  25. 25.
    Ford, D., Easton, D. F. & Peto, J. Estimates of the gene frequency of BRCA1 and its contribution to breast and ovarian cancer incidence. American Journal of Human Genetics 57, 1457–62 (1995) (cit. on p. 2).
    OpenUrlPubMedWeb of Science
  26. 26.↵
    Whittemore, A. S. et al. Prevalence of BRCA1 mutation carriers among U.S. non-Hispanic Whites. Cancer Epidemoiol. Biomarkers Prev. 13, 2078–83 (2004) (cit. on p. 2).
    OpenUrl
  27. 27.↵
    Kuchenbaecker, K. et al. Evaluation of Polygenic Risk Scores for Breast and Ovarian Cancer Risk Prediction in BRCA1 and BRCA2 Mutation Carriers. JNCI: Journal of the National Cancer Institute 109, 7 (2017) (cit. on p. 2).
    OpenUrl
  28. 28.↵
    Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics 104. Elsevier, 21–34 (2019) (cit. on p. 2).
  29. 29.↵
    Kakushadze, Z., Raghubanshi, R. & Yu, W. Estimating cost savings from early cancer diagnosis. Data 2, 30 (2017) (cit. on p. 2).
    OpenUrl
  30. 30.↵
    Cohen, L. E. Idiopathic short stature: a clinical review. Jama 311, 1787–1796 (2014) (cit. on p. 2).
    OpenUrlCrossRefPubMed
  31. 31.↵
    Bryant, J., Baxter, L., Cave, C. B. & Milne, R. Recombinant growth hormone for idiopathic short stature in children and adolescents. Cochrane Database of Systematic Reviews 3 (2007) (cit. on p. 2).
  32. 32.↵
    Finkelstein, B. S. et al. Effect of growth hormone therapy on height in children with idiopathic short stature: a meta-analysis. Archives of pediatrics & adolescent medicine 156, 230–240 (2002) (cit. on p. 2).
    OpenUrlCrossRefPubMedWeb of Science
  33. 33.↵
    Cohen, P. et al. ISS Consensus Workshop participants, 2008. Consensus statement on the diagnosis and treatment of children with idiopathic short stature: a summary of the Growth Hormone Research Society, the Lawson Wilkins Pediatric Endocrine Society, and the European Society for Paediatric Endocrinology Workshop. The Journal of Clinical Endocrinology & Metabolism 93, 4210–4217 (2007) (cit. on p. 2).
    OpenUrl
  34. 34.↵
    Wit, J. M. et al. Idiopathic short stature: definition, epidemiology, and diagnostic evaluation. Growth Hormone & IGF Research 18, 89–110 (2008) (cit. on p. 2).
    OpenUrlPubMedWeb of Science
  35. 35.↵
    Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12, 3 (2015) (cit. on pp. 2, 35).
    OpenUrlCrossRef
  36. 36.↵
    Bycroft, C., Freeman, C. & Petkova, D. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (cit. on pp. 2, 35).
  37. 37.↵
    Azodi, C. B. et al. Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3: Genes, Genomes, Genetics 9, 3691–3702 (2019) (cit. on p. 2).
    OpenUrl
  38. 38.↵
    Cunningham, F. et al. Ensembl 2019. Nucleic acids research 47, D745–D751 (2018) (cit. on p. 2).
    OpenUrlCrossRefPubMed
  39. 39.↵
    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome research 22, 1760–1774 (2012) (cit. on p. 3).
    OpenUrlAbstract/FREE Full Text
  40. 40.↵
    Church, D. M. et al. Modernizing reference genome assemblies. PLoS biology 9, p. e1001091 (2011) (cit. on p. 3).
    OpenUrlCrossRefPubMed
  41. 41.↵
    Gerstein, M. B. et al. What is a gene, post-ENCODE? History and updated definition. Genome research 17, 669–681 (2007) (cit. on p. 4).
    OpenUrlAbstract/FREE Full Text
  42. 42.
    Gingeras, T. R. Origin of phenotypes: genes and transcripts. Genome research 17, 682–690 (2007) (cit. on p. 4).
    OpenUrlAbstract/FREE Full Text
  43. 43.↵
    Portin, P. & Wilkins, A. The evolving definition of the term “gene”. Genetics 205, 1353–1364 (2017) (cit. on p. 4).
    OpenUrlAbstract/FREE Full Text
  44. 44.↵
    Stacey, S. N. et al. New basal cell carcinoma susceptibility loci. Nature communications 6, p. 6825 (2015) (cit. on p. 6).
    OpenUrl
  45. 45.
    Stacey, S. N. et al. Germline sequence variants in TGM3 and RGS22 confer risk of basal cell carcinoma. Human molecular genetics 23, 3045–3053 (2014) (cit. on p. 6).
    OpenUrlCrossRefPubMed
  46. 46.↵
    Hunter, D. J. et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nature genetics 39, p.870 (2007) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  47. 47.↵
    Easton, D. F. et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, p. 1087 (2007) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  48. 48.↵
    Grant, S. F. et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature genetics 38, p.320 (2006) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  49. 49.↵
    Buch, S. et al. A genome-wide association scan identifies the hepatic cholesterol transporter ABCG8 as a susceptibility factor for human gallstone disease. Nature genetics 39, p. 995 (2007) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  50. 50.↵
    Jiang, Z. Y. et al. Increased expression of LXRα, ABCG5, ABCG8, and SR-BI in the liver from normolipidemic, nonobese Chinese gallstone patients. Journal of lipid research 49, 464–472 (2008) (cit. on p. 6).
    OpenUrlAbstract/FREE Full Text
  51. 51.↵
    Burdon, K. P. et al. Genome-wide association study identifies susceptibility loci for open angle glaucoma at TMCO1 and CDKN2B-AS1. Nature genetics 43, p. 574 (2011) (cit. on p. 6).
    OpenUrlCrossRefPubMed
  52. 52.↵
    Woodward, O. M. et al. Identification of a urate transporter, ABCG2, with a common functional polymorphism causing gout. Proceedings of the National Academy of Sciences 106, 10338–10342 (2009) (cit. on p. 6).
    OpenUrlAbstract/FREE Full Text
  53. 53.↵
    Matsuo, H. et al. Common defects of ABCG2, a high-capacity urate exporter, cause gout: a function-based genetic analysis in a Japanese population. Science translational medicine 1, 5–11 (2009) (cit. on p. 6).
    OpenUrl
  54. 54.↵
    Vitart, V. et al. SLC2A9 is a newly identified urate transporter influencing serum urate concentration, urate excretion and gout. Nature genetics 40, p. 437 (2008) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  55. 55.↵
    Trégouët, D. A. et al. Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nature genetics 41, p. 283 (2009) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  56. 56.↵
    Valverde, P. et al. The Asp84Glu variant of the melanocortin 1 receptor (MC1R) is associated with melanoma. Human molecular genetics 5, 1663–1666 (1996) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  57. 57.↵
    Kennedy C. t. H.J.B.M.G.N.B.M.B.W.W. R. & Bavinck, J. Melanocortin 1 receptor (MC1R) gene variants are associated with an increased risk for cutaneous melanoma which is largely independent of skin type and hair color. Journal of Investigative Dermatology 117, 294–300 (2001) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  58. 58.↵
    https://www.illumina.com/techniques/sequencing/dna-sequencing/targeted-resequencing/exome-sequencing.html (cit. on p. 6).
  59. 59.↵
    Van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next-generation sequencing technology. Trends in genetics 30, 418–426 (2014) (cit. on p. 6).
    OpenUrlCrossRefPubMedWeb of Science
  60. 60.↵
    Van Hout, C. V. et al. Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank. bioRxiv (2019) (cit. on pp. 6, 35).
  61. 61.↵
    Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature communications 9, p.4038 (2018) (cit. on p. 6).
    OpenUrl
  62. 62.↵
    Abecasis, G. R. et al. Extent and distribution of linkage disequilibrium in three genomic regions. The American Journal of Human Genetics 68, 191–197 (2001) (cit. on p. 9).
    OpenUrlCrossRefPubMedWeb of Science
  63. 63.↵
    Hackinger, S. Pleiotropy in complex traits (Diss. University of, Cambridge, 2019) (cit. on p. 14).
  64. 64.
    Hackinger, S. & Zeggini, E. Statistical methods to detect pleiotropy in human complex traits. Open Biology 7, 170125 (2017) (cit. on p. 14).
    OpenUrlCrossRefPubMed
  65. 65.↵
    Socrates, A. et al. Polygenic risk scores applied to a single cohort reveal pleiotropy among hundreds of human phenotypes. bioRxiv 203257 (2017) (cit. on p. 14).
  66. 66.↵
    https://www.ukbiobank.ac.uk/wp-content/uploads/2019/12/UK-Biobank-50k-Exome-Release-FAQ-December-2019.pdf (cit. on p. 14).
  67. 67.↵
    Jia, T., Munson, B., Allen, H. L., Ideker, T. & Majithia, A. R. Thousands of missing variants in the UK BioBank are recoverable by genome realignment. bioRxiv (2019) (cit. on p. 14).
Back to top
PreviousNext
Posted February 13, 2020.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Genetic Architecture of Complex Traits and Disease Risk Predictors
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Genetic Architecture of Complex Traits and Disease Risk Predictors
Soke Yuen Yong, Timothy G. Raben, Louis Lello, Stephen D.H. Hsu
bioRxiv 2020.02.12.946608; doi: https://doi.org/10.1101/2020.02.12.946608
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Genetic Architecture of Complex Traits and Disease Risk Predictors
Soke Yuen Yong, Timothy G. Raben, Louis Lello, Stephen D.H. Hsu
bioRxiv 2020.02.12.946608; doi: https://doi.org/10.1101/2020.02.12.946608

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4241)
  • Biochemistry (9173)
  • Bioengineering (6806)
  • Bioinformatics (24064)
  • Biophysics (12155)
  • Cancer Biology (9565)
  • Cell Biology (13825)
  • Clinical Trials (138)
  • Developmental Biology (7658)
  • Ecology (11737)
  • Epidemiology (2066)
  • Evolutionary Biology (15543)
  • Genetics (10672)
  • Genomics (14360)
  • Immunology (9512)
  • Microbiology (22903)
  • Molecular Biology (9129)
  • Neuroscience (49115)
  • Paleontology (357)
  • Pathology (1487)
  • Pharmacology and Toxicology (2583)
  • Physiology (3851)
  • Plant Biology (8351)
  • Scientific Communication and Education (1473)
  • Synthetic Biology (2301)
  • Systems Biology (6205)
  • Zoology (1302)