Abstract
Despite food choices being one of the most important factors influencing health, efforts to identify individual food groups and dietary patterns that cause disease have been challenging, with traditional nutritional epidemiological approaches plagued by biases and confounding. After identifying 302 (289 novel) individual genetic determinants of dietary intake in 445,779 individuals in the UK Biobank study, we develop a statistical genetics framework that enables us, for the first time, to directly assess the impact of food choices on health outcomes. We show that the biases which affect observational studies extend also to GWAS, genetic correlations and causal inference through genetics, which can be corrected by applying our methods. Finally, by applying Mendelian Randomization approaches to the corrected results we identify some of the first robust causal associations between eating patterns and risks of cancer, heart disease and obesity, distinguishing between the effects of specific foods or dietary patterns.
Introduction
Given their profound impact on human well-being, nutritional choices and their impact on health are one of the most studied human behaviours. Quality and quantity of food consumption are associated with a wide range of medical conditions including metabolic syndrome and cardiovascular disease1, cancer1, liver disease2, inflammatory bowel disease3 and depression4. Food choice is becoming increasingly significant for global health as energy-dense, low fibre western diets proliferate across the globe and an obesity epidemic follows4. Despite the extremely high number of studies reporting food/health associations it has been hard to establish causal relationships due to difficulty in measurement, recall bias and confounding.
Recently, causal inference has been improved by a large number of studies which use Mendelian Randomization (MR) to assess the causal relationship between one or more exposures and outcomes. In MR, genetic variants are used as instrumental variables to measure the “life-long exposure” to a risk factor5. This technique has proven to be extremely powerful, not influenced by confounding typical of observational studies and many of the results have been mirrored by randomised controlled trials5. It is thus appealing to use MR to assess the causal relationship between food and health. Unfortunately, genetic variants predicting dietary consumption has been limited to a few food groups, such as alcoholic beverages6, coffee7, milk8,9, and existing evidence from dietary MR studies remain unremarkable10,11. More importantly, previous studies on a single food group have not accounted for interrelationships between different food groups. We therefore aimed to assess the causal relationship between food and several health outcomes by exploiting consumption patterns of multiple food groups in the UK Biobank (UKB) to create a new set of genetic instruments for MR analysis and then testing the causal effect of food consumption on health.12
GWAS of food traits
The first step in MR is to identify those genetic variants which are associated with the exposure of interest (food consumption in our case). We thus conducted a genome-wide association study (GWAS) on 29 food consumption traits, such as “beef” and “cheese” intake, using a mixed linear model in the white European participants of UKB13 (up to N=445,779), including only sex and age as covariates to avoid collider bias14 For a full description of the traits see Tables S1 and S2. The GWAS identified 414 phenotype-genotype associations divided into 260 independent loci with p < 1 × 10−8, summarized in Table S3 and Figure 1.
Replication for 23 of the 29 traits was sought in two additional UK based cohorts (EPIC-Norfolk15 and Fenland16) totalling up to 32,779 subjects. Despite relatively limited power, we could nominally replicate 104/325 associations at p<0.05 (one-sided test) (32%; p=9.47×10−54). The direction of effect was consistent with that for discovery in 268 of the 325 associations (82%; p=7.82×10−35, Binomial test; see Table S5). After prioritization of the genes in each locus (see Methods for details and Supp. Table S4 for the prioritized genes), we noticed that for many genes associated with BMI, the BMI-raising allele was associated with lower reported consumption of energy-dense foods such as meat or fat and with higher consumption of lower-calorie foods. Although the exact mechanism of action of many of these genes is unknown, in the case of MC4R in mice loss-of-function K314X mutants show an increase in weight, higher intake of calories and higher preference for a high fat diet17, while we observe a lower intake of fat and higher intake of fresh fruit. We thus wondered if this could be due to the effect of higher BMI on food choices instead of the reverse and if this effect might also occur for a broader range of health-related traits.
Detecting the effects of potential confounders on food frequency data
To test this hypothesis, we first selected nine diseases and risk factors for which dietary advice is usually given and for which GWA summary statistics (from large meta-analyses not including UKB) were available. Educational attainment was also included as a proxy for socioeconomic status. Using MR we identified 81 instances where we had evidence of health-related traits significantly influencing food choice (Fig. 2).
Aside from educational attainment, many associations seem to reflect common nutritional advice. For example, higher genetically-determined BMI associates with higher consumption of poultry, vegetables (both raw and cooked), non-oily fish, (also spirits and coffee); but less beef, processed meat, bread and fatty foods. Similarly, those genetically predisposed to CHD report lower consumption of whole milk, salt and lamb; and higher consumption of fish and red wine. This last case is particularly interesting, reflecting the standard dietary advice (lower intake of fat and salt but higher intake of fish as a means to increase omega-3 fatty acid intake18), but also higher consumption of red wine (and not other alcoholic beverages), which is commonly believed to have cardioprotective effects19,20.
From these MR results, it is clear that some of the loci we have identified in GWAS are not directly associated with food consumption but are the result of the effect of the health-related phenotypes on food consumption. Although we commonly consider the food-health relationship with diet as the exposure and disease as the outcome, we must consider that humans may change their behaviour because of their health status. This reverses the expected cause and effect relationship, making the interpretation of the GWAS results complex.
Correcting biases in dietary GWAS
To address the possibility of mediated effects, it is common to add the potential mediators as covariates in the association model. However, adding heritable covariates may lead to spurious associations due to collider bias (i.e. the false association between two variables induced by including a third variable (the collider) in the regression model, to which both variables of interest are causal)14. Moreover, when the causal relationship is bidirectional, adding a covariate will correct for the overall effect and not for the unidirectional effect we actually want to correct for.
We thus developed a new MR-based approach to correct the effect of each SNP in the dietary GWAS for the effect mediated through other confounding traits. Briefly, our approach consists of two steps: the first is to fit a multivariable MR model to estimate the effects of the traits we would like to test (the health-related traits in our case) on the traits of interest (the food traits). For each SNP, then an expected mediated effect is calculated, based on the effect of the SNP on the mediator traits. The expected effect is then subtracted from the observed one to get an adjusted estimate (see Methods for details). This last step is exactly analogous to estimating the direct effect in mediation analysis21.
We applied this method to all 29 food traits. As potential mediators, we used the traits tested in the univariate models, to which we added Crohn’s disease and ulcerative colitis, as they may impact dietary choices after diagnosis. We also removed total cholesterol to avoid problems due to collinearity with LDL and HDL cholesterol. Looking at the exposure traits selected for the multivariable (MV) causal model of each food trait (Supplementary Fig S3 panel A and Supplementary Table S8), educational attainment plays a fundamental role in shaping food choices, significantly influencing over half of the traits, as does BMI. Looking at the percentage of the genetic variance of the food traits explained by the health-related traits (Supplementary Fig S3 panel B and Supplementary Table S16), it ranges from 42% for cheese to ~0% for fortified wine and white wine/champagne, highlighting the scope these effects have to influence GWAS results. The combined results from all traits before and after adjustment for the effect of health status on food preference are shown in Fig. 1 (see Supplementary file 1 for trait-specific plots). In many loci previously associated with health-related traits, the effect changed dramatically, suggesting that the effect of the SNP on the food traits is mediated through health status. For example, the effect size of the lead FTO variant (rs55872725) with percentage fat in milk reduces by three-fold from 0.0045 to 0.0015 log units (p=2×10−29 and p=7×10−5, respectively). We observed similar effects for other associations at the same locus, which suggests that in general the associations we are observing near FTO are primarily mediated through its strong association with BMI22. This insight is crucial to understanding: a naïve approach would interpret that eating less healthy foods and more calorie-dense foods would lead to a lower BMI, while in fact, our analysis suggests that it is having a higher BMI that leads to either having a healthier diet or reporting one. This accords with known biases in a dietary assessment23. Unfortunately, we cannot distinguish between a change in behaviour (and thus indication bias) or such reporting bias. These results warrant even greater caution in using SNPs influencing diet in MR or for functional follow up studies. Moreover, most nutritional epidemiological studies have focused only on BMI and socioeconomic status for correction, while we show that the confounding effects extend to many other health traits such as blood pressure and lipids. The widespread effect of education and BMI on dietary choices is especially strong on cheese and percentage fat in milk. This may explain some of the recent epidemiological results linking dairy product consumption to positive health benefits24.
To further explore the effects of the correction procedure, we compared the correlation patterns between the food traits and 832 phenotypes present in the LD hub25 database using the raw and corrected results (See Supplementary Data 2.3 and additional table S10). These analyses showed that the correction produced more meaningful food clusters and that in many cases the genetic correlations with other traits changed greatly (see https://npirastu.shinyapps.io/rg_plotter_2/ for a graphical representation of these results). For example, if we look at the relationship of the two fat intake traits (percentage fat in milk and adding spread to bread) and body fat percentage we can see that they both have a seemingly beneficial effect before correction (rG = −0.43 and −0.10, respectively) which diminishes to near zero (rG = −0.04 and 0.07) after applying the correction, suggesting that the apparent protective effect is likely due to confounding.
Clustering of food items
To investigate how the mediation procedure affected the genetic correlations amongst the consumption traits and with other traits, we first compared the clustering based on the uncorrected and adjusted genetic correlations. Figure S7 panel A shows the tanglegram comparing the two analyses. The adjusted correlations give more reasonable groupings, showing that some of the unadjusted clusterings are due in part to common confounders (e.g. wine clustering closer to coffee than other alcoholic beverages) than actual common genetic background.
Clustering of the food traits based on their corrected genetic associations using ICLUST identified five different food groups (Fig S7 panel B): one composed of increased meat, fat, salt and decreased vegetarianism (labelled as “Meat/Fat”), one made up of alcoholic beverages and coffee (labelled “Psychoactive drinks”) and one comprised of healthier items such as fish, fruit and vegetables (labelled “Low-Calorie Foods”). Two final groups contained only two items each: drink temperature and tea; and cheese and bread; these were not used for the MV analysis. In order to explore if additional loci influence these groups, we ran a multivariate GWAS using the package MultiABEL, which performs MANOVA on summary statistics. 168 additional associations, including 42 novel loci not identified in the single-trait analysis, were identified in multivariate analysis of the three main food groups (Table S5).
Selection of instruments for MR
The primary objective of our study is to use MR to assess causal relationships between food choices and health. To achieve this goal we need to be able to identify the SNPs which have only a direct effect on the food trait, which is not mediated through other possible confounders. We hypothesised that if a SNP is biologically associated with a food behaviour – without mediation by health – its effect should not change strongly after the adjustment procedure. To try to distinguish the variants with only a direct effect from those with effects at least partly mediated through other traits, we defined the corrected-to-raw ratio (CRR) as the ratio between the corrected effect and the raw uncorrected one.
Through extensive simulations we estimated that the CRR range between 0.95 and 1.05 maximises this probability, with 88% of the SNPs being directly associated with the trait of interest (see Supplementary Data 2.1 for details on the simulations and Supplementary Data 1.8 for theory). Further evidence comes from variants in alcohol dehydrogenase 1B and the taste and olfactory receptors (for which clear biological pathways can be defined): all have CRR values between 0.95-1.05. We thus defined SNPs with a CRR in this range as “non-mediated”. 387 out of 581 associations corresponding to 208/302 loci (~69%) were categorised as non-mediated associations, although of these 50 showed both mediated and non-mediated effects. The balance of mediated to non-mediated SNP associations varied by foodstuff, ranging from none mediated for tea, spirits and processed meat to all mediated for percentage fat in milk and adding spread to bread (see Table S3). The necessity of using the CRR filtering instead of existing methods is further outlined in additional paragraph 2.7.
Functional annotation of the direct-effect-only loci and tissue enrichment analysis prominently feature brain areas involved in reward (Supplementary Data 2.5). Inference of interaction networks reveals ten communities ranging from feeding behaviour and energy metabolism to steroid response, acetylcholine receptor regulation and synaptic transmission (Supplementary Data 2.6 and Figure. 4).
Causal inference
We proceeded to perform two-sample MR using the food traits as exposures and 78 traits (see table S17 for a list and description) as outcomes (chosen to include those for which diet could be a causal factor, that were in MR-base and for which full GWAS summary statistics were available). As well as using each single food trait as exposures, we also assessed the effect of 16 different principal components (PC)-derived phenotypes based on the previous clustering of food traits, to quantify the consequences of broader dietary patterns. The relationships between the different traits are reported in figure S2 while loadings for each PC trait are reported in Fig 5 panel A. Traits which had no direct-effect-only SNPs (percentage fat in milk, fortified wine and adding spread to bread) were left out of the analysis. For each exposure-outcome pair, four types of analyses were performed, selecting instrumental variables with or without filtering by CRR or using corrected or uncorrected betas. We considered as the main analysis the CRR-filtered analysis using uncorrected betas and used the others for comparison. Finally we considered as significant the exposure-outcome pairs after multiple test correction of the main analysis using Storey’s q-value at q<0.05. Table 1 reports the significant results, while all results can be found in table S18 and are available through a shiny app https://npirastu.shinyapps.io/Food_MR/.
Looking at the significant MR results, we detected no sign of directional pleiotropy using the MR-Egger test (results in table S18). In some cases, we did detect strong heterogeneity of effect, especially with All PC1 and in general with PC-food exposures which included several diverse items. Considering more specific results, all PC.1 differentiates those eating more meat and salt while drinking more alcohol and coffee from those who eat more fruit and vegetables, thus it describes a general healthy-unhealthy diet continuum. All PC1 showed the largest number of associations (15; Fig.S22a), with a healthy value of All PC1 lowering most risk factors linked to obesity and lipid profile (and likely consequently lowering cardiovascular disease risk) and having a positive effect on height and education. With the exception of educational attainment, these results may not be surprising as they broadly overlap with general dietary advice. However, when we decompose these effects into food groups or single foods, we detect differences amongst traits. For example, All PC 1 leads to very similar effects across different obesity/adiposity measures: body fat % (β=−0.080,p=3.2×10−4), body mass index (β= −0.087,p=8.1×10−5), waist-to-hip ratio (=−0.104, p=2.4×10−6) and BMI-adjusted waist-to-hip ratio (β=−0.078,p=2.9×10−4). Figure S23 shows the comparative effects of each food on the four obesity measures: generally, the individual foods affect all four in very similar ways showing that the estimates are stable regardless of the outcome. However, there are some exceptions, for example, both Fresh Fruit and Oily Fish affect Body Fat and both waist:hip ratio measures but not BMI, suggesting that their effect is specifically on adiposity and not body size.
As a whole, alcohol does not seem to impact any of the four obesity traits, with a very small effect on waist-to-hip ratios. However, looking at each alcoholic beverage individually, beer has a substantial and specific effect on BMI not seen for the other alcoholic beverages, suggesting that this effect is independent of alcohol content.
Another notable result is the association of oily fish consumption with height (β= 0.2, p=1.76×10−8) (Fig S22c). It is unclear, however, if this is the result of general healthy eating or if it is the effect of a specific food. In particular if we look at the effects of All PC1-3, we see that a height-raising of PC1 (higher healthy foods, less alcohol/coffee and meat β= 0.09, p=1.35×10−4), a height-lowering effect PC2 (lower healthy food and meat and higher alcohol/coffee β= −0.1, p=1.34×10−3), but no effect of PC3 (higher meat and less alcohol/coffee and healthy foods β=−0.02, p=0.65) suggesting that the effect on height is lead by healthy foods and alcohol/coffee but independent of meat. Looking at the associations of Healthy PC1-3, we see association only with the first which represents the overall consumption of fish, fruit and vegetables. Finally, comparing these three we find that both higher consumption of vegetables and fish are associated with being taller, with similar effect sizes (Fish PC1, β=0.17, p=4.99×10−4 and Vegetables PC1, β=0.15, p=1.30×10−3), while fruit has no effect (β= 0, p= 0.96), which makes the effects of fish and vegetables indistinguishable.
Several associations seem to be masked by the confounding effects, for example if we look at genetically-determined beef intake, the CRR-corrected instruments show a significant association with being taller (β= 0.51 SD adjusted vs. β= −0.01 unadjusted) and with other anthropometric traits such as hip and waist circumference. None of these associations were recovered using the raw instruments with estimated effects extremely close to 0, showing that the problems arising from using the unadjusted set of instruments are not limited to false positive results but also can generate false negatives, depending on the biases involved.
Discussion
Our results emphasise how complicated relationships among dietary traits are. We have clearly shown that the causal path between food and health is not unidirectional and that in fact genes may affect food behaviours in many different and unexpected ways. Understanding the origins of these effects is fundamental not only for prioritizing loci for functional follow up, but also for understanding why genetic correlations and GWAS results change when different datasets or populations are used. In fact, given that many of the effects we see are likely due to confounding, if the health advice in different populations changes this could alter the architecture of the studied trait and thus the GWAS results, which would appear as allelic heterogeneity.
It is unclear whether these effects are limited to dietary phenotypes or if they extend to other traits and further studies are needed to resolve this issue. Recent similar studies10,11 on the genetic bases of dietary patterns reported having detected no reverse causality. We believe that this difference is due to our novel approach, which is not based on using the potential confounders as covariates, but rather exploits MR, which should be able to distinguish the forward and reverse effects when the causal relationship is bidirectional. Nevertheless, extreme care is required when claiming causal relationships between food and health as the level and complexity of the biases and confounding is so high that it affects even MR, which is known to be more robust than other approaches to these types of effects.
In a classic dietary analysis, investigators evaluate macronutrient compositions. In this study, we did not see similar effects from foods which have similar macronutrient composition. For example, if we look at cheese and meat, which are both relatively high in saturated fat and protein, we see no association of eating either with blood lipid profile (triglycerides, LDL or total cholesterol), while they have opposite effects on BMI (cheese lowering it and meat increasing it) (Fig S22e).). While the findings require further investigations in mechanisms and related behaviours, our genetic evidence lenders the support for the importance of food consumption and dietary patterns, not only intakes of specific nutrients26.
If we look at which foods have the greatest effect on triglycerides, it is fruit, vegetables and fish; all with lowering effects (Fig S22f), not sources of carbohydrates or alcohol, known drivers of de novo lipogenesis. This seems to be confirmed by looking at the results with the overall PC traits (All-PC1, -PC2, -PC3) in which a higher consumption of fruit, vegetables and fish is always associated with lower triglycerides regardless of the loading on other food groups. It is impossible, however, to separate the effects of fruit, vegetables and fish from each other, in fact, if we look at the Healthy PC traits (see fig 5 panel A), only PC1, which summarises a higher consumption of all three is associated with lower triglycerides, suggesting the combined effects of all the three dietary factors or unmeasured correlated dietary behaviours or healthful habits.
This example shows that when considering the effect of food on health it is sometimes hard to separate the effect of single foods (although we have shown some examples) from those which are usually consumed together in a pattern. In this case, although fish and fruit and vegetables have a very different macronutrient composition it is impossible to separate their effect on triglycerides. This has been implied in previous studies including the European study on lactase persistence gene9. There, while the MR relating lactase-persistence gene to diabetes incidence supported no causal evidence of milk consumption, the secondary analyses identified the lactase-persistence variant would relate to consumption of potatoes, poultry, and cereals. These pieces of genetic evidence highlight the importance of a dietary pattern rather than single foods or nutrients. Any health claim from observational studies regarding one or the other should always take into account these facts. For further details of specific results, our online app allows exploration of hypotheses.
Our study was limited by the number of items available in the dietary questionnaire in the UK BioBank and thus has not explored the full extent of human nutrition, unfortunately apart from bread consumption no carbohydrate or sugar sources were measured, limiting our ability to explore these macronutrients and thus capture the overall diet. Nonetheless, this limitation is unlikely to turn over the abovementioned cautionary interpretation of the dietary MR results. Another important limitation is that effect sizes could be inflated because of the underestimation of the SNP effects on the food traits which will increase MR estimate effects. This under-estimation is due to the noise in the questionnaire responses, which warrant further statistical investigations. Of note, as we have no rationale to consider non-random measurement error, it is unlikely to hinder the detection of a causal effect or its direction, but further studies are needed to assess the precise effect sizes. Before translation of our findings into policy, more studies using different methodologies will be required.
In conclusion, we have developed an important framework and new tools to help illuminate the effects of nutrition on health and have shown that despite the existing belief that certain dietary assessment provides low-quality data, it is still possible to extract useful information using our methods. It will be interesting to learn to what degree the confounding of food choice reporting by educational attainment and disease risk factors observed here is seen in other settings with different food cultures and social stratification to the UK.
Author Contributions
NP,JFW,JRBP,ZK,EJG,FRD,KKO contributed to the study design. JFW,TE,JRBP,AR,TG,FI,KKO,FRD contributed data. NP,CMD,EJG,NM,FI,JZ,NT,KAK,MPC, performed the statistical analyses. NP, JFW,ZK, JRBP, TE, NT,KF,CMD,LR,EJG,FI,KKO,FRD contributed to the interpretation of the results. All authors contributed to writing and editing of the text.
Data Availability
All GWAS results will be made available through GWAS catalog at the time of publication. All results from the MR analyses have been shared in the additional tables.
Acknowledgements
J.F.W. acknowledges support from the MRC Human Genetics Unit quinquennial programme grant “QTL in Health and Disease”. This research has been conducted using the UK Biobank Resource under Application Number 19655. We would like to thank Erin MacDonald-Dunlop, and Pascale Lubbe for help with statistical analyses.
EGCUT was funded by Estonian Research Council Grant IUT20-60, PUT1660 (T.E.), PUT1665 (K.F.), the European Union through the European Regional Development Fund grant no. 2014-2020.4.01.15-0012 GENTRANSMED and 2014-2020.4.2.2, and Estonian and European Research Roadmap grant no.2014-2020.4.01.16-0125. The EPIC-Norfolk study (DOI 10.22025/2019.10.105.00004) has received funding from the Medical Research Council (MR/N003284/1, MC-PC_13048, and MC-UU_12015/1). The Fenland study (DOI: 10.1186/ISRCTN72077169) was funded by the Medical Research Council and the Wellcome Trust (Ref: 074548). J.P., K.O., F.I., and F.R.D. were funded by the UK Medical Research Council Epidemiology Unit core grant (MC-UU_12015/2, MC_UU_12015/5). T.R.G. receives funding from the UK Medical Research Council (MC_UU_00011/4). Z.K. received funding from the Swiss National Science Foundation (31003A_169929). We are grateful to all the participants who have been part of the project and to the many members of the study teams at the University of Cambridge who have enabled this research.