Abstract
Background There is now convincing evidence that pleiotropy across the genome contributes to the correlation between human traits and comorbidity of diseases. The recent availability of genome-wide association study (GWAS) results have made the polygenic risk score (PRS) approach a powerful way to perform genetic prediction and identify genetic overlap among phenotypes.
Methods and findings Here we use the PRS method to assess evidence for shared genetic aetiology across hundreds of traits within a single epidemiological study – the Northern Finland Birth Cohort 1966 (NFBC1966). We replicate numerous recent findings, such as a genetic association between Alzheimer’s disease and lipid levels, while the depth of phenotyping in the NFBC1966 highlights a range of novel significant genetic associations between traits.
Conclusions This study illustrates the power in taking a hypothesis-free approach to the study of shared genetic aetiology between human traits and diseases. It also demonstrates the potential of the PRS method to provide important biological insights using only a single well-phenotyped epidemiological study of moderate sample size (~5k), with important advantages over evaluating genetic correlations from GWAS summary statistics only.
Introduction
The emergence of large-scale GWAS results has demonstrated an enrichment of genetic variants affecting multiple phenotypes, confirming that pleiotropy is a common feature of the human genome [1,2]. Several statistical genetics methods have been developed to quantify this shared genetic architecture formally, such as bivariate genome-wide complex trait analysis (GCTA), LD Score regression and the polygenic risk score (PRS) approach [3–⇓5]. Applying these methods, a large number of studies have tested shared genetic aetiology between two traits, and more recently these have been expanded to estimate pairwise genetic overlap across multiple traits [6–⇓8].
The PRS approach utilises GWAS summary statistics to produce individual-level risk or profile scores and is, therefore, the technique that offers most hope for future personalised or precision medicine [9,10]. Single nucleotide polymorphisms (SNPs) associated with a phenotype below a specific P-value threshold are used to produce a score that predicts the risk of that clinical outcome or infers its trait values. Individual polygenic scores can then be used to predict other traits in a regression across a study sample to expose genetic overlap between traits. The key benefits of using the PRS method over alternatives relate to modelling flexibility and statistical power. Exploiting individual-level cohort data allows a greater number of phenotypes and models to be tested [11], relative to relying on GWAS summary statistics only. This can enable greater exposure of causal pathways than via application of LD Score Regression [6]. Moreover, GCTA requires large-scale (N > 10k) individual-level data on all tested phenotypes for sufficient power, while LD Score Regression requires large-scale GWAS summary statistics on all phenotypes, yet the PRS method is well-powered from using only large-scale GWAS data for the base phenotypes and relatively small-scale (N < 10k) individual-level data on the target traits. A potential limitation of PRS analyses is that subject overlap between the GWAS samples and the target cohort samples can lead to false positive associations. However, this can be addressed directly by recalculating the base GWAS with overlapping cohorts removed, or mathematically by using their GWAS results in a recalculation of the meta-analysis GWAS excluding their effect (see Methods).
For the current analysis, PRS were generated in the NFBC1966 participants in relation to 48 phenotypes with large-scale GWAS results available. These PRS were used to predict over 100 traits in the NFBC1966 cohort data. The NFBC1966 contains detailed information on clinical outcomes and health-related behaviour, offering an opportunity to test many traits and combinations not previously investigated. Hence, the depth of phenotyping in the NFBC1966 offer a unique possibility to shed further light on the genetic overlap among phenotypes.
Methods
5404 participants with genotype data on 364,590 SNPs were available in the NFBC1966 [12]. Baseline data were collected on maternal and offspring demographic, clinical and anthropometric traits in early life. Follow-up data were collected at 14 and 31 years of age on a range of traits, including blood pressure, BMI, cardiovascular fitness, atopy, asthma, infections and lifestyle traits. Blood samples for DNA, lipids, glucose, insulin and hormones were also taken. Polygenic risk scores were calculated for each of the 5404 individuals using their genome-wide data and 48 publicly available GWAS (see S1 Table). These were tested for their association with 143 traits measured in the NFBC1966 (see S2 Table). Subject overlap between the GWAS summary statistics and the NFBC1966 would cause inflation of the results and so the contribution of the NFBC1966 to the discovery GWAS was either: (1) removed directly by reanalyzing the discovery GWAS with the NFBC1966 data excluded, or if this was not an option then, (2) by recalculating what the meta-analysis GWAS results would have been had the NFBC1966 data been removed based on the effect size estimate and corresponding standard error for each SNP based on GWAS (conducted in the same way as the meta-analysis GWAS) performed on the NFBC1966 only. The latter was enabled via the inverse-variance meta-analyses formulae (this recalculation is approximate if the less commonly used sample-size weighted meta-analysis method was used in the discovery GWAS) as follows: where βALL is the effect size estimate meta-analysed across n cohorts, where the βi for each cohort i is weighted by the inverse variance of its standard error such that its weight, ωi, equals .
So, to find the effect size estimate, βADJ, and SEADJ,, with cohort k removed, by definition:
Thus: and by definition and (2):
The polygenic risk score software PRSice [13] was used for the data analysis, which involved performing linear (continuous traits) or logistic (binary traits) regression of NFBC1966 target phenotypes on PRS, with PRS computed in relation to each of the 48 GWAS summary statistic results, to test for their association. The SNPs in the base (discovery GWAS) had ambiguous SNP genotype calls removed and strands flipped where necessary. SNPs in linkage disequilibrium (LD) were “clumped” using a threshold of r2 < 0.1 across 250kb windows to ensure that those analysed are largely independent [13]. Ancestry informative covariates were generated from the target genotype data using Principal Component Analysis (PCA), and the first 10 PCs were included in the regressions to control for population stratification. Further analyses also controlled for sex, socio-economic status and BMI to potentially increase power or expose mediation effects (e.g. see S3 Fig, S4 Fig, S5 Fig, S6 Fig and S7 Fig). High-resolution scoring was performed in PRSice to identify the most predictive PRS for each trait from the large number of PRS that can be formed by inclusion of groups of SNPs with different GWAS association P-values thresholds [13]. While this makes the P-value for association between the PRS and target phenotypes over-fit, we apply a significance threshold of P = 0.004 for each association test based on a permutation study for high-resolution scoring by Euesden et al. [13]. Bonferroni correction for the large number of genetic-phenotype tests performed (6864) produces a conservative significance threshold of P < 5×10−7 (0.004/6864), with thresholds of P < 0.001 and P < 0.01 used to indicate potential associations in the data.
Traits pertaining to socio-economic status and exercise measures were originally coded with the highest number on the questionnaire pertaining to the lowest measure of each trait (e.g. “5” for the lowest ability for “running 5km”; see S3 Table), so these were recoded in the opposite direction to provide greater clarity in the results.
PRS-sex interactions were also tested for the 162 genetic-phenotype associations that exceeded the significance threshold (P < 5×10−7) in linear models with the corresponding outcome trait regressed on PRS, sex, 10 PCs and PRS*sex interaction term. Bonferroni correction for the number of interactions tested produces a conservative significance threshold of P < 3×10−4 (0.05/162).
Results
PRS were calculated for 5404 participants with genotype data available in NFBC1966 using publicly available GWAS summary statistics (see S1 Table). The PRS across the sample were tested for associations with phenotypes from data collected both in early life and at 31 years in the NFBC1966 participants. Data included anthropometric measurements, blood measurements (e.g. cardiometabolic risk factors), hormone levels, and questionnaire data at baseline and 31 years, socio-economic factors, medical history and health related behaviours. Details of all phenotypes are provided in S3 Table. Altogether, PRS were computed using 48 GWAS meta-analysis summary statistics and were tested for their association with 143 NFBC1966 phenotypes, corresponding to 6864 tests in total.
We grouped the results into 6 categories of related target phenotypes: medical conditions (Fig 1), metabolic traits (Fig 2), lifestyle and social factors (Fig 3), health (S1 Fig), and anthropometrics (S2 Fig). We present the results corresponding to each category by heat maps depicting the associations between each PRS and each target phenotype, with significant associations highlighted by asterisks.
Given the unusually large number of results that an analysis of this kind produces, and its relative novelty, we recommend considering the following points while inspecting the results: (i) the statistical power is a function of the sample sizes of both the discovery GWAS and target data, and while we have removed those that are highly underpowered there remains high variation in power among the results, (ii) the observation of genetic associations limits the opportunity for confounding to produce spurious associations but the results here do not distinguish between two main plausible explanations for associations: horizontal pleiotropy (genetic effect is on the two traits directly) and vertical pleiotropy (genetic effect is on one trait, which has a downstream effect on the second trait) [2] and the order of any causation is not necessarily from discovery to target trait (see Discussion), (iii) the same basic adjustment for covariates (see Methods) is performed across all tests, so results may change qualitatively with adjustment of risk factors particularly relevant to the target trait under study, (iv) given the lack of mechanistic insight and replication in these results, they should be viewed more as hypothesis-generating than confirmatory. Our hope is that particular results will motivate and guide follow-up investigations by researchers with expertise in the corresponding phenotypes. The results that we highlight and summarise below reflect only those that we consider some of the more interesting results and are necessarily only a subset of the potentially important results.
Fig 1 shows associations between the 48 PRS and 46 NFBC1966 medical conditions that were either self-estimated by the participant or self-reported as being verified by a physician; these include metabolic disorders, psychiatric disorders, infectious diseases and allergies. None of the associations in this category were significant after applying the stringent Bonferroni correction for multiple testing, which may reflect the small number of cases of many of the conditions in the sample. However, there were a number results with suggestive evidence (P < 0.001) that confirm expectations or that may of interest for follow-up studies. For example, the associations between schizophrenia PRS and depression, schizophrenia and psychosis are as expected, as is that between Parkinson’s disease PRS and self-reported neurological disease [14–⇓16]. The positive association between heart disease PRS and migraine is supported by epidemiological research [17,18], but is in contrast with previous genetic studies suggesting that the genetic component for migraine with aura is protective for heart disease [19]. Moreover, the PRS for HDL cholesterol, a proposed protective factor for cardiovascular diseases [20], was negatively associated with gallstones [21], follicle stimulating hormone with self-reported cancer [22], and thyroid stimulating hormone with asthma [23]. The Cannabis smoking PRS was positively associated with asthma, inborn heart disease and cardiac insufficiency [24,25]. There are also several suggestive associations relating to psychological trait PRS. Positive affect PRS is positively associated with reduced eczema [26], extraversion PRS is positively associated with chlamydia, supporting literature linking extraversion to high risk sexual behaviour [27], the openness PRS is positively associated with mental health problems [28–⇓30], while the PRS on environmental sensitivity is negatively correlated with angina and cancer [31,32].
Fig 2 depicts the associations between the 48 PRS and 19 NFBC1966 cardiometabolic traits. There are a large number of significant results here, including many that would be expected based on the epidemiological literature. Some associations have been observed in the genetic literature previously, such as between BMI PRS and the major lipids (LDL, HDL, cholesterol) and C-reactive protein [33], while others are novel, such as between the lipids and testosterone, sex hormone binding globulin and insulin [34]. A genetic overlap between Alzheimer’s disease and plasma lipids HDL, LDL and triglycerides was replicated [35], but there were additional associations with total cholesterol and VLDL here. VLDL also shows suggestive evidence of associations with the PRS of childhood IQ, ADHD and cigarette smoking, indicating a potentially greater role for this lipoprotein than previously thought [38,39]. The associations between cardiovascular disease and diabetes PRS and the lipids are as expected from the epidemiological literature [36,37], while the suggestive evidence for associations between anorexia and neuroticism PRS and insulin support the proposed role for genetics in the shared aetiology between insulin and cognitive function [40].
Fig 3 shows the associations between the 48 PRS and 38 NFBC1966 lifestyle and social factors, mostly comprising occupation, smoking and alcohol consumption measures. The results pertaining to education highlight the potential for mediation by lifestyle to produce genetic pleiotropy; for example, the college and ‘years in education’ PRS are negatively associated with beer/cider and wine amount but positively associated with wine frequency, which may reflect the adoption of different social lifestyles according to attendance at university and adulthood socio-economic position. This is supported by the associations between the education PRS and socio-economic status measures here, and also by twin studies linking education and health behaviours [41, 42]. The PRS for HDL is positively correlated with most of the alcohol consumption measures and negatively correlated with the smoking measures, while the opposite pattern is observed for the Triglycerides PRS. However, both the HDL and Triglycerides PRS have strong positive associations with the Oral Contraceptive Pill (OCP), which most likely reflects the impact of increased lipid levels among individuals on OCP in the lipids GWAS samples; that is, genetic factors affecting uptake of OCP may have been captured by the lipids GWAS due to the lipid-altering effect of OCP. Birth weight PRS was positively associated with both mother’s age and father’s age. This reflects findings in the literature linking lower maternal age with increased odds of low birth weights, although the association is U-shaped [43]. It has been suggested that social disadvantage underlies the low maternal age-low birth weight link [44]; nevertheless, our data suggest that whatever the underlying causal factor is, it is under a degree of genetic control.
S1 Fig and S2 Fig display the associations between the 48 PRS and anthropometrics and health traits. The associations relating to traits such as height, weight, blood pressure, physical activity and diabetes are as expected based on the literature [45–⇓47], but the heat maps also reveal potentially novel insights. For example, exercise, especially running, is positively correlated with the PRS for childhood IQ, years in education, bipolar disorder and positive affect, but negatively correlated with the ADHD PRS, which may highlight a potentially important, but mixed, role of physical activity in psychiatric disorders. Exercise has been shown to alleviate psychiatric symptoms [48,49] but since there are relatively few individuals in our data with psychiatric disorders then these associations may be more likely due to mediation by exercise rather than its therapeutic effects. Likewise, the PRS for breast cancer, Crohn’s disease and diabetes are associated with several physical activity measures as expected from epidemiological findings [50,51].
PRS*Sex interactions
Interactions between PRS and sex were also investigated (see Methods). The top results from the interaction analyses are presented in Fig 4. A significant effect modification by sex of the association between HDL PRS and sex hormone binding globulin levels (P = 8.13×10−6) was observed, while several other interactions were only marginally significant (P < 0.05). These results reflect the general finding in the literature that the autosomal genetic influence of complex traits is largely similar between males and females, with genotype by sex interactions having very small effect sizes compared to the main effects [52].
Discussion
Here we performed a large-scale systematic survey of genetic-phenotype associations using a set of 48 GWAS summary statistics and 143 phenotypes measured in the Northern Finland Birth Cohort 1966 (NFBC1966). Novel associations and replications were identified across a broad array of clinical, cardiometabolic, anthropometric, infectious disease, psychiatric and lifestyle traits. While this study is among a growing number of large cross-trait studies investigating shared genetic aetiology among human phenotypes, it represents the largest medically focused such study using the polygenic risk score (PRS) approach to date [6,7]. The use of the PRS method here highlights the potential for exploiting a single epidemiological study to gain insights into the underlying aetiology of a huge number of phenotypes, given the rich phenotyping typical of such studies. This is in contrast to the popular LD Score regression method [4], which requires large GWAS to have been performed on all traits under study and does not allow control for covariates.
The large number of GWAS exploited here and depth of phenotyping of the NFBC1966 meant that patterns of genetic-phenotype associations corresponding to related traits emerged, offering both internal support for associations as well as highlighting apparently conflicting results that deserve specific follow-up. For instance, associations between education PRS and a range of alcohol and smoking measures that potentially indicate mediation via socio-economic position, are supported by associations between the education PRS and socio-economic status variables. However, while the PRS for HDL and Triglycerides had associations in the opposite direction across almost all target traits, they had the same strong positive correlation with the oral contraceptive pill. Therefore, a profile of associations is observed among related traits, particularly useful for highlighting potential causal pathways and guiding follow-up investigations.
The central limitation of such a large-scale systematic study is that the testing performed on any specific phenotype is inevitably superficial in nature. While some of the genetic-phenotype associations observed suggest particular aetiological explanations, especially when considering groups of related associations, other than sex-interaction analyses we performed no further statistical testing to gain additional insights. However, we believe that this large-scale hypothesis-free approach to investigating shared genetic aetiology among human phenotypes has much value: within a single consistent analysis we have revealed evidence for shared genetic aetiology among hundreds of traits and provided what may be considered a ‘treasure trove’ of avenues for follow-up investigations. As similar studies are performed on different data sets and populations, patterns of replicating associations will emerge. The possibility of ‘collider bias’ [53] should be considered in relation to any of the observed associations but should be minimized by the use of a birth cohort, in which the vast majority of births in Northern Finland in 1966 were included and a high proportion of these genotyped. While spurious pleiotropy is also possible [2], the fact that these are genetic associations should be a greater indication of genuine causation than classical epidemiological associations. For instance, the associations observed here between alcohol consumption, smoking, education and socio-economic status suggest causal links between these factors, which has implications for epidemiology, in which these measures are often considered as only confounded by each other. However, such inference relating to these associations is only speculative until rigorous follow-up investigations are performed to uncover causal mechanisms.
This study has demonstrated that taking a hypothesis-free polygenic risk score approach to the investigation of shared genetic aetiology among phenotypes is an effective way of replicating previous, and uncovering novel, genetic-phenotype associations. The key advantage of requiring a relatively small target sample is the opportunity to exploit a greater depth of phenotyping, revealing a higher resolution profile of genetic overlap than possible otherwise. Our hope is that these genetic-phenotype associations provide a foundation and guide for investigations to reveal the pathways that lead to disease, both internally within the body, and externally through mediation via behavior and lifestyle.
Acknowledgements
We thank the late professor Paula Rantakallio (launch of NFBC1966), the participants in the 31yrs study and the NFBC project centre.
We thank the International Genomics of Alzheimer’s Project (IGAP) for providing summary results data for these analyses. The investigators within IGAP contributed to the design and implementation of IGAP and/or provided data but did not participate in analysis or writing of this report. IGAP was made possible by the generous participation of the control subjects, the patients, and their families. The i–Select chips was funded by the French National Foundation on Alzheimer’s disease and related disorders. EADI was supported by the LABEX (laboratory of excellence program investment for the future) DISTALZ grant, Inserm, Institut Pasteur de Lille, Université de Lille 2 and the Lille University Hospital. GERAD was supported by the Medical Research Council (Grant n° 503480), Alzheimer’s Research UK (Grant n° 503176), the Wellcome Trust (Grant n° 082604/2/07/Z) and German Federal Ministry of Education and Research (BMBF): Competence Network Dementia (CND) grant n° 01GI0102, 01GI0711, 01GI0420. CHARGE was partly supported by the NIH/NIA grant R01 AG033193 and the NIA AG081220 and AGES contract N01–AG–12100, the NHLBI grant R01 HL105756, the Icelandic Heart Association, and the Erasmus Medical Center and Erasmus University. ADGC was supported by the NIH/NIA grants: U01 AG032984, U24 AG021886, U01 AG016976, and the Alzheimer’s Association grant ADGC–10–196728. P.F.O receives funding from the UK Medical Research Council (MR/N015746/1) and the Wellcome Trust (109863/Z/15/Z). This report represents independent research (part)-funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health. Data on coronary artery disease / myocardial infarction have been contributed by CARDIoGRAMplusC4D investigators and have been downloaded from http://www.CARDIOGRAMPLUSC4D.ORG.
Data on head circumference, childhood obesity, pubertal growth, tanner stage and birth weight traits has been contributed by EGG Consortium and has been downloaded from http://www.egg-consortium.org.
The Educational Attainment GWAS results were accessed under the Data Sharing Agreement of the Social Science Genetic Association Consortium. We thank the SSGAC for facilitating this research. C.A.R. acknowledges funding from the Netherlands Organisation for Scientific Research (NWO Veni grant 016.165.004)
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.