Abstract
Understanding the causal pathogenic mechanisms of diseases is crucial in clinical research. When randomized controlled experiments are not available, Mendelian Randomization (MR) offers an alternative, leveraging genetic mutations as a natural “experiment” to mitigate environmental confoundings. However, most MR analyses treat the risk factors as static variables, potentially oversimplifying dynamic risk factor effects. The framework of life-course MR has been introduced to address this issue. However, current methods face challenges especially when the age-specific GWAS datasets have limited cohort sizes and there are substantial correlations between time points for a single trait. This study proposes a novel approach, estimating a unified system of structural equations for a sequence of temporally ordered heritable traits, requiring only GWAS summary statistics. The method facilitates statistical inference on direct, indirect, and path-wise causal effects and demonstrates superior efficiency and reliability, particularly with noisy GWAS data. By incorporating a spike-and-slab prior for genetic effects, the approach can address extreme polygenicity and weak instrument bias. Through this methodology, we uncovered a protective effect of BMI on breast cancer during a confined period of childhood development. We also analyzed how BMI, systolic blood pressure (SBP), and low-density cholesterol levels influence stroke risk across childhood and adulthood, and identified the intriguing relationships between these risk factors.
1 Introduction
Understanding the pathogenic mechanisms of diseases is a foundational challenge in clinical research. Given the limitations of conducting randomized controlled experiments in certain cases, there has been a growing reliance on Mendelian Randomization (MR) as an alternative approach. MR leverages genetic mutations and inheritance as a natural “experiment,” effectively mitigating unmeasured environmental confounding in epidemiological studies [1, 2]. However, most MR analyses use a cross-sectional design and treat the risk factor as a static variable, ignoring the fact that inheritable risk factors often change over time and may have a time-varying effect on diseases [3, 4]. As shown in earlier studies, this can lead to oversimplified and even misleading conclusions from MR studies [5]. For instance, observational studies have shown that vitamin D levels in childhood, but not adulthood, are associated with the risk of multiple sclerosis. However, standard MR approaches based on the vitamin D level in adult individuals suggest its causal effect in the etiology of multiple sclerosis [6].
To address this challenge, life-course MR has recently emerged as a framework to consider how risk factors that are measured throughout an individual’s lifetime may influence later life outcomes [7]. One commonly employed approach is to apply multivariable MR (MVMR) methods such as IVW-MVMR [8] to estimate the direct causal effects of the risk factor at each time point under linear and additive causal effect models. However, the efficiency of multivariable MR may be compromised due to the limited cohort size in Genome-wide Association Studies (GWAS) of earlier life traits and high auto-correlations of the risk factor across time. It is also challenging to evaluate the indirect causal effect of the risk factor at earlier time points mediated by later time points using multivariable MR. Alternative approaches involve g-estimations of structural mean models [4, 9], or functional principal component analysis to aggregate the effects of a risk factor across time points [10]. However, both methods require access to individual-level GWAS data that is typically not readily available on a large scale. Additionally, the method from [10] is unable to distinguish between the direct and indirect effects across different time points.
Using genetic variations as instrumental variables throughout the life course, we propose a new approach to estimate a unified system of structural equations for a sequence of heritable traits that are ordered temporally, only requiring GWAS summary statistics for each trait. Our estimated model gives a “full picture” of the time-varying risk factors and enables causal mediation analysis by allowing researchers to quantify various aspects of causal effects, including direct effects, indirect effects, and the proportion of each path-wise effect relative to the total effect. While our model shares similarities with simultaneous equations models in econometrics (Davidson, 1993), we can deal with weak and invalid instruments and allow arbitrary interactions among genetic variants and environmental factors. Furthermore, our model can simultaneously analyze multiple traits at any given time point, accounting for known confounding exposures.
In comparison to alternative approaches, our method shows superior efficiency and reliability, particularly when dealing with noisy GWAS data from small cohorts for age-specific traits. By incorporating a spike-and-slab prior for genetic effects, we account for the extreme polygenicity of complex traits and avoid weak instrument bias, a common issue in MR. Additionally, we discussed sufficient conditions on the identifiability of any direct and indirect causal effects, accommodating pervasive horizontal pleiotropy and arbitrary mediator-outcome confoundings. Applying our new method, we uncover a protective effect of Body Mass Index (BMI) on breast cancer which is confined to a specific period during childhood development. The method also allows us to unravel the intricate relationships among BMI, systolic blood pressure (SBP), and low-density cholesterol levels across both adulthood and childhood, and their effects on stroke. Our analyses suggest that, among these traits, adulthood SBP may be the only factor with a direct causal effect on stroke. Furthermore, the previously identified causal effect of BMI on SBP in adulthood [11, 12] might be explained by the confounding effect of childhood SBP.
2 Material and methods
2.1 Structural equations on individual-level data
Consider a sequence of risk factors (X1, X2, …, XK−1) in temporal order, where Xi temporally precedes Xj when i < j. We are interested in the causal relationship between these traits and their causal effect on the outcome, Y which will be denoted as XK and is typically a specific disease status in adulthood. Denote Z as the vector of all genetic variants. Figure 1 illustrates the causal directed acyclic graph (DAG) showing the causal relationships among the genetic variants, risk factors, outcome, and unmeasured environmental confounders U. Causal effects between the traits must follow the temporal order (only an earlier trait can causally affect a later trait), but we allow the genetic variants to have direct causal effects on any trait at any time. Based on this, we assume the following additive structural equations for each individual:
In these equations, the functions f1(·), …, fK(·) are the direct effects of genetics and environmental factors on traits, which can be non-linear and involve arbitrary interactions. Our main assumption is linearity and homogeneity of the causal effect βkl for any earlier trait Xl (1 ≤ l ≤ K − 1) on any later trait Xk (l +1 ≤ k ≤ K). As discussed in the next section, this assumption of linearity and homogeneity becomes crucial, especially when our data only includes GWAS summary statistics, a condition also shared by previous summary-data MR methods.
Model overview. The causal directed graph associated with structural equations (1). X1, …, XK−1 are exposure traits in temporal increasing order and Y is the outcome trait. The blue arrows represent the causal effects of the genetic variants Z on the traits. The red arrows represent the effects of the unmeasured non-heritable confounders U.
2.2 Model on summary statistics
Two-sample MR is a popular implementation of MR that uses two independent GWAS samples, one for the exposure and one for the outcome [13]. Similar to many existing methods for two-sample MR, we only require GWAS summary statistics for each trait, so one may call this design “K-sample MR”. The fact that summary statistics are sufficient is due to the linearity and homogeneity of the causal effect in (1). This implies a linear model with measurement error on the GWAS summary statistics, as we will explain next.
Specifically, let γkj ≡ argminγVar[Xk −γZj] denote the coefficient corresponding to the least squares projection of the trait Xk on the genetic variant Zj. The GWAS summary statistics provide estimates of these marginal associations along with their standard errors. In particular, we assume we observe for any SNP Zj and trait Xk, where the variances
are known. Further, let
denote the projection of the genetic and environmental direct effects of trait Xk onto SNP Zj. If a SNP Zj is used as an instrumental variable for trait Xk, then for any k′≠k the value
can be viewed as the pleiotropic effects of SNP Zj on another trait
. Notice that, unlike the marginal associations γkj, these direct genetic effects αkj are generally not identifiable from GWAS data. We will make a “balanced pleiotropy” assumption in later sections that is akin to the Instrument Strength Independent of Direct Effect (InSIDE) assumption in the literature [12, 14].
By projecting the structural equations (1) onto the SNP Zj, we obtain the following linear equations (Supplementary Text):
These equations can be more conveniently expressed in matrix form as:
where P is the number of SNPs and the matrices are defined as
Equation (2) can be alternatively represented as
where I is the K×K identity matrix, and B is the lower-triangular matrix that solves
. Compared to (2), the parametrization in (3) avoids matrix inversion, making it more amenable for statistical estimation. It is easy to show that
can be written as a Neumann series:
Intuitively, the (k, l) entry of the matrix B denotes the total causal effects of Xl on Xk through all directed pathways [15].
We assume that we can use separate GWAS datasets to obtain p-values for SNP selection, thereby mitigating potential selection biases [16, 17]. The selection p-value for each SNP is defined as the Bonferroni-corrected p-value, computed from K − 1 GWAS summary datasets corresponding to traits X1 to XK−1. SNPs are subsequently chosen based on the selection p-values using LD clumping [18], ensuring that selected SNPs are approximately independent. Compared to a stringent p-value threshold (such as 10−8) for SNP selection, our method allows for a higher threshold (such as 10−4 or 0.01) and uses SNPs that are weakly associated with the exposure traits. This can generally increase the power of the MR analysis.
Upon selecting the SNPs, another set of GWAS datasets is used to obtain the summary statistics for the exposure and outcome traits. For each SNP Zj, the summary statistics follow a normal distribution
, where Γj := (γ1j, γ2j, …, γKj)T is a vector of marginal associations and Σj is a covariance matrix obtained from the GWAS standard errors and a correlation matrix that depends on the extent of sample overlap and the correlation between the traits. This correlation matrix is shared across the SNPs and can be estimated from the non-statistically significant GWAS summary statistics using the method described in [17] (Supplementary Text).
2.3 Identifiability of direct and indirect causal effects
As the MR design uses genetic variants as instrumental variables, a crucial assumption is that these instrumental variables are valid. In particular, MR relies on the assumption that the genetic variants have no horizontal pleiotropic effects [19]. Selecting suitable SNPs for this purpose is particularly challenging due to the complex polygenic nature of complex traits [20, 21]. Many recent MR studies have proposed methods to conduct MR in the presence of horizontal pleiotropy, with various additional assumptions on the pleiotropic effects [12, 14, 22, 23].
In the context of life-course MR, we confront an additional challenge that arises from unmeasured mediator-outcome confounding, which can incur biases in the estimation of direct and indirect causal effects even when the treatment is completely randomized [24]. This is illustrated in Figure 2 with two exposures and a outcome (K = 3). If we only use genetic instrumental variables for the first risk factor X1 (Figure 2a), even if all SNPs are valid IV, it is not possible to separate the direct and indirect effects of X1 due to unmeasured mediator-outcome confounding between X2 and Y (see Supplementary Text for a counter-example). In contrast, if each risk factor has its own set of instrumental variables (Figure 2b), it is possible to identify all direct and indirect causal effects under the linear structural equation model in (1). However, given that the sequence of risk factors is often the same risk factor measured at different time points, it may be difficult to find SNPs that only exert their effects at a specific time point.
Model identifiability under three scenarios
We propose an alternative and more realistic assumption, which generalizes the InSIDE assumption [14] and is illustrated in Figure 2c. We allow the SNPs to exert effects on all risk factors continuously, and they may all have pleiotropic effects on the outcome and are thus not strictly valid instrumental variables. Nonetheless, we show that the causal coefficient matrix can still be identified provided two key assumptions are satisfied. First, we require the direct effects α1j, …, αKj are independent across the traits. Second, the number of SNPs P needs to converge to infinity; in practice, this means that P needs to be sufficiently large. Notice that we only assume independence among αkj, and the marginal associations γkj can still be correlated across the traits indexed by k. Additionally, we do not need to assume that the genetic associations across SNPs within each trait are identically distributed and allow for heterogeneity across SNPs. These new assumptions allow us to address both pleiotropic effects and mediator-outcome confounding. A formal mathematical statement of this new identifiability result and its proof can be found in the Supplementary Text.
2.4 Model estimation and inference
We use a hierarchical Bayesian framework and Gibbs sampler to infer the direct and indirect causal effects, which can all be expressed in terms of the matrix B. The Gibbs sampler allows us to conveniently use posterior samples to construct credible intervals of any function of B, including any direct and indirect effects, and proportions of these effects out of the total effects.
Based on earlier investigations [17], both the genetic associations and pleiotropic effects can be highly polygenic, indicating that most elements in A are likely nonzero. Although most SNPs only have weak effects on complex traits, a small set of SNPs may be responsible for the core biological process and have a strong effect on the traits. Thus, as the most critical component of our Bayesian hierarchical model, we assume the following spike-and-slab distributional assumptions on αkj:
This model assumes a two-component Gaussian mixture on the direct genetic effects, wherein all SNPs are permitted to have non-zero genetic effects on each trait and a subset of SNPs is allowed to exhibit larger effects. In addition, we put Gaussian priors on elements of the matrix B and conjugate priors on the hyperparameters:
To estimate the above Bayesian hierarchical model, we employ a Gibbs sampler algorithm which can efficiently generate posterior samples when K is small. Posterior samples of B directly give us posterior samples of
, which can then be used to construct credible intervals for any direct and indirect causal effects, or their functions. Additionally, we use a simplified empirical Bayes approach to choose the hyper-parameters
and (ak, bk), recognizing their significant impact on the posterior distributions of B. For further computational and mathematical details, see the Supplementary Text.
2.5 Extension to multiple traits at each time point
To account for known confounders, we further expand our model to accommodate multiple traits at any given time point (Figure 3). Specifically, at each time point k, we assume there are nk ≥ 1 exposures of interest, which are denoted as . To facilitate the MR analysis, we require two key assumptions. Firstly, we assume that there is not causal effect between the traits at the same time point. This assumption avoids the need to identify the causal directions of any two traits that are at the same time point. If there is a known causal direction between two traits measured at time point k, we can add a “pseudo-time point” (or stage) k + 1 after k where we move the outcome trait of the two traits to stage k + 1. Our second assumption is a generalization of the independence assumption on direct genetic effects described in Section 2.3. In the more general setting considered here, we require that the direct genetic effects αkj of any SNP j must be independent across all traits, including those at the same and different time points. With these assumptions in place, we can adapt our model and Gibbs sampler to infer the direct and indirect causal effects. For more details, see Supplementary Text.
Illustration of causal relationships across traits allowing multiple traits at each time point.
2.6 Data sources
GWAS summary statistics are downloaded from public sources. For adult BMI, data is obtained from the GIANT consortium website (https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files) and the UK Biobank Neale’s lab (http://www.nealelab.is/uk-biobank/) using phenotype code 21001. Adult lipid traits (LDL-C, HDL-C, and triglycerides) summary statistics are acquired from the GLGC cohort via the GLGC website (https://csg.sph.umich.edu/willer/public/lipids2013/) and from the GERA cohort through GWAS Catalog with study accession numbers GCST007141, GCST007140, and GCST007142. Breast cancer GWAS summary statistics come from GWAS Catalog with study accession number GCST004988, while those for T2D are downloaded from the DIAGRAM website (https://diagram-consortium.org/downloads.html), and for stroke from GWAS Catalog with study accession number GCST006906.
For childhood traits, GWAS summary statistics on 1-year-old BMI and 8-year-old BMI are contributed by the Centre For Diabetes Research, University of Bergen, Norway, and the Norwegian Mother, Father and Child study, downloadable from their website (https://www.fhi.no/en/studies/moba/ for-forskere-artikler/gwas-data-from-moba/). Childhood BMI GWAS data from the EGG Consortium can be found at http://egg-consortium.org/childhood-bmi.html. The ALSPAC datasets for childhood lipid traits, SBP, and BMI are obtained from GWAS Catalog with study accession numbers GCST90104679, GCST90104678, GCST90104680, GCST90104683, GCST90104677.
3 Results
3.1 Simulation studies: benchmarking with multivariable MR
We compare our Bayesian framework with the alternative approach using MVMR to perform causal mediation analysis [6, 25]. Specifically, we benchmark our method with IVW-MVMR [8], which is widely used in practice, and GRAPPLE [17], a frequentist approach designed to account for independent pleiotropic effects and weak instruments. At each step k, we apply the chosen multivariable MR method to estimate the direct causal effects of X1, …, Xk−1 on Xk. A major drawback of this multi-step approach to life-course MR is that is is difficult to provide reliable statistical inference on indirect and pathwise effects, as the estimates across the different steps are correlated.
We generate synthetic GWAS summary datasets by soft-thresholding real GWAS summary statistics, allowing the “true” genetic effects to be exactly 0 for some SNPs. Specifically, if we observe with Variance
for a particular SNP j in a real GWAS dataset, we set
and also randomly shuffle αkj across j within each trait k to ensure independence across traits. Once the matrix A is generated, we further generate Γ following our linear model (2) with a pre-specified matrix B. To specify B, we simulate three scenarios with K = 3, K = 4, and a multivariate case where there are multiple exposures at each time point (Figure 4). Finally, we simulate our synthetic summary statistics
independently across all SNPs and traits.
Simulation scenarios
To mimic real MR mediation analysis, where earlier childhood traits typically have a smaller sample size than adulthood traits, we generate synthetic data for earlier traits based on childhood GWAS data. Specifically, we use childhood BMI data from the Norwegian Mother, Father, and Child Cohort Study (MoBa) [26] and childhood lipid traits from the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort [27]. Adult GWAS datasets are used to generate summary statistics for the remaining traits. See SI test for additional details of the simulation setup.
Figure 5 shows the coverage and average lengths of the confidence and credible intervals of the direct and indirect causal effects when K = 3 obtained using different methods. SNPs are selected based on cutoffs of the p-values calculated from , indicating the strength of the true association between SNP j and trait k. Genetic variants that are weakly associated with the traits are selected under a large p-value threshold. As shown in the figure, for the direct causal effects, though the IVW-MVMR method has the shortest confidence intervals, they are also severely under-covered, possibly due to the presence of pleiotropic effects and weak instruments. In contrast, both our Bayesian approach and GRAPPLE do not suffer from the weak instrument bias, and have good coverage irrespective of the strength of the SNPs. Compared to GRAPPLE, our Bayesian approach offers more efficient and powerful inference. Additionally, regarding the indirect causal effects of X1, only our Bayesian approach provides reliable inference, and our credible intervals demonstrate good coverage and reasonable power. We also observe similar advantages of our Bayesian approach in the other two simulation scenarios: K = 4 and the multivariate case (Figure S1-S2).
Simulation results for K = 3. The first row illustrates empirical coverage of 95% confidence intervals on direct effects of X1 and X2 on Y and indirect effects of X1 over 500 repeated simulations. The second row displays the boxplots of lengths of confidence intervals over repeated simulations.
3.2 Effect of early life body size on breast cancer
Several recent studies have indicated that early-life body size serves as a protective factor against the risk of breast cancer [28, 29, 30]. Using IVW-MVMR for the mediation analysis, [3] observed that the protective effect of early-life body size on breast cancer is not mediated by adult body size, while adult body size itself does not causally influence breast cancer. However, in their analysis, though the adult body size was quantified by the adult BMI trait, the representation of early-life body size relied on the early-life body score from UK Biobank, which is a questionnaire recall trait asking adults to recall whether they were thinner or plumper than average at the age of 10. As discussed by the original authors, the use of such an imprecise measure for childhood body size may raise concerns about the credibility of the scientific conclusions when compared to using direct measurements of childhood BMI.
To address this, we revisit the analysis, comparing the earlier results with a new approach employing our Bayesian method and directly utilizing childhood BMI GWAS summary statistics to represent childhood body size. Similar to [3], for the adult body size we use the adult BMI trait from UK Biobank and use GWAS summary statistics on breast cancer from [31] for the outcome trait. To avoid instrument selection biases, we select SNPs based on their significance in separate GIANT Adult BMI [32] and EGG childhood BMI [33] GWAS datasets. As shown in Figure 6a, when early-life body score from the UK Biobank is used as the exposure, our Bayesian method, along with IVW-MVMR and GRAPPLE, replicates the findings from [3], indicating no causal effect of adult BMI and a direct protective effect of early-life body size on breast cancer.
Evaluation of the effect of body size on breast cancer at different ages. a) Estimated effects of childhood body size (from UK Biobank) and adult BMI on breast cancer risk as estimated by MV-IVW, GRAPPLE, and our Bayesian approach. b) Estimated effects of childhood BMI and adult BMI on breast cancer risk from different methods. c) Estimated causal DAG from our Bayesian approach with selection p-value threshold at 10−2. The black arrows indicate significant direct effects.
However, replacing the early-life body size trait with the GWAS trait of 8-year-old BMI from MoBa yields surprising results. IVW-MVMR suggests a protective causal effect of adult BMI on breast cancer, contradicting earlier conclusions, while childhood BMI shows no direct effect (Figure 6b). In contrast, GRAPPLE loses its power to detect any causal effects. Only our Bayesian approach provides a similar conclusion as the earlier analysis, affirming the direct protective effect of childhood BMI. Despite the 8-year-old BMI trait offering a more precise measurement of childhood BMI, its use in MR is challenging due to its small sample size (3K samples compared to half a million in the UK Biobank). To assess the change in instrument strength when replacing the early-life body size trait from UK Biobank with the 8-year-old BMI GWAS trait from MoBa, we compute the conditional F-statistics [34] of the exposure traits. We observe a substantial decrease in the conditional F-statistics in the childhood dataset (Figure S3), indicating a loss of instrument strength due to the limited cohort size of the MoBa study. Similar to our simulations, the analysis demonstrates that IVW-MVMR can produce unreliable confidence intervals, while GRAPPLE lacks power when the GWAS dataset sample size for any of the exposure traits is small. Our new Bayesian method demonstrates both efficiency and robustness, especially in the mediation analysis where GWAS studies for early-life traits always have small sample sizes.
To further enhance our understanding of the impact of early-life body size on breast cancer, we include infant BMI, specifically the 1-year-old BMI GWAS trait from MoBa, in our mediation analysis. As the 1-year-old BMI and 8-year-old BMI traits share the same cohort, we also incorporate the estimated noise correlation matrix in our Bayesian approach (Figure S4). As shown in Figure 6c, we observe a causal effect of 1-year-old BMI on 8-year-old BMI, but no direct causal effects of 1-year-old BMI on adult BMI or breast cancer. This suggests that the protective effect of body size on breast cancer is confined to a specific period during childhood development.
3.3 high blood pressure, BMI and stroke
High blood pressure is widely acknowledged as the primary modifiable risk factor for stroke [35]. Through univariate MR analyses, a recent study has also revealed additional potential causal risks for stroke outcomes including the adult BMI [36]. Moreover, using univariate MR, recent studies have suggested that BMI has a significant positive causal effect on high blood pressure in adulthood [11, 12]. We aim to unravel the intertwined impacts of high blood pressure and BMI on stroke, delving into the dynamic evolution of these causal relationships throughout the life course. To ensure robust findings, we will also account for potential confounding risk factors, such as lipid traits.
Specifically, we analyze the causal effects of three risk factors: systolic blood pressure (SBP), BMI and low-density lipoprotein cholesterol (LDL-C) in both childhood and adulthood on stroke, using our multivariate mediation analysis framework (Figure 3). As our multivariate framework does not allow any causal relationships among the risk factors at the same time point, to accommodate the potential causal effects of BMI on SBP, we create two additional “time points” (stages) for the SBP traits (Figure 7a). For the childhood BMI, LDL-C and SBP traits we use GWAS summary statistics from the ALSPAC cohort [27]. We use GWAS summary statistics from UK Biobank for adult BMI and SBP, and the summary statistics from Global Lipids Genetics Consortium [37] for adult LDL-C. For stroke, the summary statistics are from [38]. SNP selections are based on the p-values in the GERA GWAS for SBP and LDL-C [39, 40], and the p-values in GIANT adult BMI dataset and EGG childhood BMI dataset.
Multivariate mediation analysis on LDL-C, BMI, SBP and stroke. a) Design of the stages. Childhood traits are marked with label (C) and adulthood traits are marked with label (A). b) Estimated causal effects with different selection p-value thresholds. Arrows correspond to the significant effects, with thicker arrows indicating larger effects. Positive effects are red and negative effects are blue.
Figure 7b illustrates the estimated causal directed graph at various selection p-value thresholds using our Bayesian method. More significant arrows emerge with milder p-value thresholds, suggesting increased power with the inclusion of weak instruments. As anticipated, all childhood exposures exhibit significant causal impacts on their corresponding adulthood exposures. Our findings indicate a positive causal direct effect of adulthood SBP on stroke, while no compelling evidence supports direct effects from other traits on stroke, including childhood traits and adult BMI.
A surprising result from our analysis is a lack of support for a causal effect of adult BMI on adult SBP, which is contrary to earlier univariate Mendelian Randomization (MR) findings [11, 12]. To take a closer examination, we also perform multivariable MR, with childhood SBP and adult BMI as exposures and adult SBP as the outcome. Both IVW-MVMR and GRAPPLE indicate that adult BMI no longer exerts a positive causal effect on adult SBP after accounting for childhood SBP (Figure S5a). Additionally, we observe a stronger genetic correlation between childhood SBP and adult BMI compared to that between adult SBP and BMI (Figure S5b). Collectively, these results suggest that the confounding effect of childhood SBP may have contributed the association of adult BMI on SBP identified in previous univariable MR analysis.
4 Discussion
Based on a unified model across all traits, we propose a Bayesian approach with GWAS summary data for life-course Mendelian Randomization. This proposed method allows the assessment of time-varying causal effects of heritable risk factors, distinguishing between direct and indirect causal effects of traits in temporal order. Addressing a key challenge in life-course MR—specifically, the high genetic association of a trait across different ages and the limited cohort size of age-specific GWAS—our method exhibits superior performance. Our method enjoys robustness to bias in using weakly associated SNPs as instruments and can efficiently integrate information across traits.
A key assumption in our causal structural equation system is linearity and homogeneity of the risk factors’ causal effects. Concerns may arise about the use of linear structural equations when the outcome trait or the exposure trait is binary. In the former case, MR methods based on linear models can still estimate meaningful but attenuated causal effects [12]. In the latter case, caveats of MR are discussed in [41]. Interpreting the estimated linear structural equations may be challenging in practice, particularly when there is a sequence of binary exposures across time.
For the identification of causal effects in life-course MR, a similar argument was discussed in [7]. They show that in order for the causal direct effects to be identified, genetic variants must exert different effects on each exposure in the model, and these effects must be linearly independent, or equivalently, the true SNP-trait association matrix has full rank. This is necessary but not sufficient for identifying all direct and indirect effects. For instance, in univariable MR, pleiotropic effects must be restricted (such as by the InSIDE assumption) for causal identification. Merely assuming that the pleiotropic effects are not perfectly correlated with the genetic associations with the exposure does not suffice to ensure the identification of the causal effect. An additional assumption, such as our independence assumption of direct associations across traits, is needed to ensure the separation of direct and indirect causal effects. While our independence assumption permits pervasive pleiotropy, it assumes no correlated pleiotropy for any traits. In practice, if confounding pathways exist between traits leading to correlated pleiotropy, we recommend collecting additional GWAS data for confounding traits and explicitly adjusting them using the multivariate extension of our approach.
A limitation of life-course MR is missing time points, where exposures at additional time points play a role in causal relationships but are not considered in the MR analysis [6, 42]. If the exposure trait at the missing time point Xk is a confounder of any two later-time traits and
in our model, it may invalidate the assumptions in our model. Specifically, if a selected SNP Zj is associated with Xk, its direct associations with
and
become correlated through the paths
and
, thereby violating our independence assumption. However, our independence assumption remains true if Xk is not a hidden confounder for any pair of subsequent traits. Examples include when Xk solely influences the outcome or the exposure at the next t time point.
Another concern about life-course MR is the possibility of reverse causality, where the outcome trait may exert causal effects on the exposure at the latest time point [6]. Typically, life-course MR benefits from clear causal directions owing to the temporal order of traits. However, in many applications a notable challenge arises when the last exposure and the outcome are collected at the same time, such as in adulthood. In cases where the selected SNPs primarily associate with the outcome through their connections with the exposures, MR generally remains resilient to bias caused by reverse causation [6, 17, 43]. Our framework shares this advantageous property of MR.
5 Declaration of Interests
The authors have declared that no competing interests exist.
6 Acknowledgments
J.W. is partly supported by the National Science Foundation under grants DMS-2113646 and DMS-2238656. Q.Z. is partly supported by EPSRC grant EP/V049968/1.
7 Code availability
The R package MrMediation for conducting our Beyesian mediation MR analysis is publicly available for installation at (https://github.com/ZixuanWu1/MrMediation).
Supplemental materials
S1 Additional mathematical details
S1.1 Derivation of summary-data linear models from individual structural equations
Recall that we define
Projecting Xk onto Zj, we have
where corr(Zj, ϵkj) = 0. On the other hand, we have
where corr
. Substitute (1) into (2) for l = 1, …, k − 1, we further have
Since corr
, both (1) and (3) are linear projections of Xk on Zj, thus
S1.2 Noise correlation estimation
In section 2.3 we claim that when traits have overlapping cohorts, the estimates are not independent across k, but approximately, all SNPs share the same correlation matrix which can be effectively estimated from the GWAS summary statistics themselves As shown in [1], for any risk factor k and l we have
where Nk and Nl are the sample sizes of the risk factor Xk and Xl, Nkl is the size of overlapping samples and Xks denotes the measure of Xk for individuals s. The correlation of Xk and Xl of any shared sample is Corr[Xks, Xls]. As a consequence, we assume
where R is the unknown shared correlation matrix.
To estimate R, we choose SNPs where γkj = 0 for all k so that we can estimate the shared correlations using the sample correlation of the chosen SNPs. To do this we select SNPs with p-values pj ≥ 0.5 in all selection files.
S1.3 Identifiability
In this section, we provide a formal mathematical statement for identifying the causal effect matrix in the linear model:
where
In the main text, it was highlighted that the causal effect matrix
is not identifiable when only genetic instrumental variables for the first risk factor X1 is involved in the multivariable MR analysis. Here we give a simple example to illustrate how the direct and indirect effects can be not inseparable in these cases. Suppose
where Z = (Z1, Z2, …, ZP) are independent SNPs. Suppose we have infinite sample size, thus (α1, α2, …, αP), β1(α1, α2, …, αP) and (β2 + β1β3)(α1, α2, …, αP) are directly observed. In this case β1 can be simply identified by looking at the ratio of the first two marginal effect. On the other hand, for any
, one can define
, so that
. Hence in this case the direct effects and indirect effects of X1 on Y are not identifiable.
Now suppose for each single exposure trait we have some instrumental strength. As discussed in the main text, we do not assume that the SNPs are valid IVs for any single exposure trait but allow a large number of SNPs P to be used. Unlike the identifiability problems in standard linear models, we do not assume identical distributions across SNPs to allow for arbitrary heterogeneity. We can identify in the following sense:
Under model (6), we further assume
Infinite GWAS data: the sample sizes nk of the GWAS data for each trait k are large enough that the marginal associations γkj can be uniformly consistently estimated across j and k by the summary statistics when P → ∞ and mink nk → ∞.
Independence: αkj’s are mutually independent across k and j
Bounded moment:
for some constant M.
Well-behaved limiting average moments: For any k, l ∈ [K], we have
exists and
exists
Non-zero instrumental effect: For any k ∈ [K − 1], we have
Then elements of in (6) can be consistently estimated when P → ∞ and mink nk → ∞.. Proof. Let
. Then alternatively we can write
We only need to show that elements in B can be consistently estimated as by definition,
. Under the infinite GWAS data assumption, without loss of generality, we treat all γ kj as directly observed.
First, we prove the following lemma.
Let (Xj, yj) be independent random variables. Suppose
where Xj ╨ ϵj. Assume the following quantities exist:
In addition, assume that
for some constant M
where
is full rank.
Then
is identifiable from the data when n → ∞.
Proof of Lemma 2. First note we have
Additionally, since
we obtain that
where the inequality in the third line is due to
. Thus we have
Now we prove Theorem 1 by induction. We first define the following notations. For any matrix M, define
(similarly for M≤k,j and Mk,≤j). We shall show by induction that for k = 1, 2, …, K − 1, we have
Bk,<k can be consistently estimated
exist and can be consistently estimated when P → ∞.
Note the existence of µk and υk are guaranteed by assumptions. So we will only show the existence of ck in our induction.
Base case. When k = 1, we need to show µ1, υ1, c1 exist and can be consistently estimated. Here the existence of c1 is implied by the existence of υ1 and µ2 once we observe that
Note here we have α1j = γ1j are directly observed. Therefore by law of large numbers, under the bounded moment assumption, we have
Induction Step. Suppose 1, 2 hold for all k ≤ k0 − 1. Then we use Lemma 2 to show
can be consistently estimated. Denote
To apply Lemma 2, we first check assumptions. Note
By the induction hypothesis and our assumptions of existence of limits, all of these quantities exist. Moreover, the bounded second moment assumption in Lemma 2 is satisfied by our bounded fourth moment assumption. The condition
is directly assumed in the statement of the theorem. The full-rank assumption of
is implied by the non-zero instrumental effect because
To see this, note for any k, we have
Therefore
Hence
for any k. One can use similar arguments to show the off-diagonals of
are zeros.
Combining all the results we have, by Lemma 2,
We can consistently estimate
as long as
, µy and c can be consistently estimated. Given their definition and the induction hypotheses, we can consistently estimate
and c. In addition, by the law of large numbers, under the bounded moments condition, we have
This shows that
can be consistently estimated.
It remains to show exist and can be consistently estimated. Here
exists as it is weights sum of moments of α’s:
Since
can be consistently estimated, we know
can also be consistently estimated. Let
be our consistent estimator of
. We know
And for any k ≤ k0
This finishes our proof of induction. Now we have for any k ∈ [K − 1], Bk,<k, µk, vk, ck can be consistently estimated. Following the same proof in the induction step, by Lemma 2, we know BK,<K can be consistently estimated. This finishes the proof.
S1.4 Details of the Gibbs Sampler
We observe GWAS summary statistics, , of marginal associations between SNPs and traits, with standard errors δkj. Denote Δj = diag(δ1j, … δKj). Denote the correlation matrix betwen trait-association for all SNPs by R.
The model is
We have K traits and P genes. The prior parameters are
and bk. Also let
S1.4.1 Notation
For a K × P matrix M, let
Mi,<k denotes a row vector of all elements of M in the ith row and in columns less than k
Mi,* denotes a whole row, and M*,j a column.
vec(M) denotes the vectorization of M, i.e.,
For K = P, define LTvec(M) to be the lower-triangular vectorization of M (excluding the diagonal):
which is a
-dimensional vector. Let υ be a K dimensional vector:
Define
which has dimensions
. Here 0k denotes the zero vector of dimension k. Hence, if M is square, lower-triangular with zero diagonal,
Also, for random variables U and V, p(U|V) denotes the conditional density of U given V.
S1.4.2 Updating B
With prior:
likelihood (noise distribution):
so that
We have that
Where
From Theorem 2.2 in [2], we know the posterior is
S1.4.3 Updating σ2, σ0, σ1
First observe that conditioning on B, we have
(as σ is independent with A, Z, p, σ0, σ1 and depends on
only through B) If follows that
So if suffices to consider p(σ|B).
We have prior:
and likelihood:
so we have posterior:
where n = K(K − 1)/2 is the number of degrees of freedom in B.
Similarly for σ0 and σ1, we have
Thus
Also for any k ∈ [K], since σk0 is independent of
, we have
Similarly
Note
with prior,
where q ∈ {0, 1}.
Then the posterior is,
Where
.
S1.4.4 Updating pk
Note
Therefore
Since pk is independent with
, for k′ ≠ k, we have
With prior
and likelihood
we have Posterior:
Where
.
S1.4.5 Updating (A, Z)
To sample (A, Z) we first sample marginal of Z given everything but A. Observe that
Divide both sides by
. By independence among columns of
, we have
Therefore we can just inspect each Z*,j on its own. We have a likelihood given by:
Where
and we have prior
In cases where K is small enough, we can easily compute the un-normalized probability weights for all possibilities of Zj, and sample from a discrete distribution.
Now that we have a sample from the marginal of Z we can sample from the conditional of A, which has prior of
Recall that we have the linear Guassian model,
with noise distribution,
so we get a posterior similar to before of
S1.5 Estimation for hyper-parameters
In our empirical studies, we find that the posterior distributions of B can be sensitive to the choice of the hyper-parameters and (ak, bk). Here we take an empirical based approach to choose the hyper-parameters. Specifically, these hyperparameters will be set to the corresponding maximum likelihood estimators in the marginalized model, which can be obtained by the expectation-maximization (EM) algorithm. Recall that our model assumes for k ∈ {1, 2, …, K},
In particular,
We shall first estimate the hyper-parameters for the outcome pleiotropies. In the E step, we compute the posterior distribution of the latent variables given our current estimates of parameters. In the M step we solve the maximization problem of the expectation of the log likelihood over the latent variables.
To get rid of the identifiability issue, we initialize and
to be far from each other. When solving for
and
in the M-step, we put the restriction that they can be at most 10 times larger than the their values from last iteration. Once we have obtained the point estimates of (p1, σ10, σ11), we choose the hyper-parameters for their priors such that the prior means are equal to the points estimates.
For k ∈ {2, 3, …, K}, however, this is theoretically unachievable because are no longer accessible. In practice we set
to
to determine the hyper-parameters. The priors will be biased especially when the genetic effects of a trait is largely mediated by an upstream trait that is already in the model. Nonetheless, our numerical simulations suggest that this bias may be small in most reasonable scenarios. We point out that one possible alternative approach is to use this as a first degree approximation and then use the Gibbs sampler to estimate B and then use that to estimate the direct genetic association, and repeat the expectation maximization. In the optimal settings, if the estimates of B in the first stage is close to the true value, this two-stage procedure will improve the precision of the algorithm. Indeed it turns out that this second stage estimation can usually be quite helpful in univariate Mendelian Randomization. On the other hand, the step of estimating the direct genetic association aggregates the noise from multiple exposures and makes it harder for EM algorithm to find meaning and reliable estimates for the hyperparameters. Therefore the two-state procedure is only recommended when the number of traits are small and the noise levels are fair.
S1.5.1 EM details
Recall that our model assumes for k ∈ {1, 2, …, K},
In particular,
For simplicity, we will drop the subscript 1 for the time being.
E-step. Denote .With latent variable zj ∼ Bernoulli(p), where
comes from the first distribution when zj = 1, the membership probabilities are
where ϕ(; µ, σ) is the density of normal distribution with mean µ and standard deviation σ. Then we can compute the expectation of the data-log-likelihood with respect to the distribution of Z given the data and θ(t):
where C is a constant that does not depend on θ.
M-step. Taking derivative with respect to p and setting it to zero yields
For
, it is equivalent to minimize
and similarly for
.The objectives can be maximized with a numerical optimization method.
S1.6 Extension to multiple traits at each time point
In this section we propose an extension of the Bayesian framework which allows multiple traits at a single time stage. Specifically, suppose there are K time stages and N traits in total. At each time stage k, there are nk exposures/outcomes of interests , with no internal interactions. Define X1, …, XK as
Let
be the marginal association between Xki and Zj. Define γ1,j, …, γK,j as
Similarly define α1,j, …, αK,j as
Let
Similar to the previous setting, from the individual level equations we have
where
is the matrix of coefficients. Furthermore, since there is no interactions between Xki’s within each time stage k, we the coefficient matrix can be written as
Where
represents the square matrix of zeros,
represents the matrix of marginal associations between the Xk and (Xl)1≤l<k.
Equivalently by defining we can write
where B is of the form
This is a direct conclusion from the fact that
S1.6.1 Adaption of Gibbs Sampler
The implementation of the Gibbs sampler for the multivariate version is identical to the previous version except for the updates of B, since we impose addition structure on B. To preceed, we make the following definitions:
Define
for k = 1, 2, …, K.
For any i ∈ {n1 + 1, 2, …, Nk}, define f (i) to be the unique integer k in {1, 2, …, K} such that Nk−1 < i ≤ Nk. Define g(i) = Nf (i).
For any matrix B ∈ ℝ N×N, given n1, …, nk, define
For any v ∈ ℝ N, define
Hence, we have
Then we show how to update B. With prior
Likelihood
and
we have
where
So
The other parts are identical to the previous case.
S2 Additional details of the simulation setup and real data analysis
S2.1 Simulation setup
Our simulation is based on multiple real GWAS summary statistics datasets. For the K = 3 scenario, we select SNPs using data from GIANT Adult BMI and EGG childhood BMI GWAS datasets [3]. The chosen SNPs are then associated with traits including 8-year-old BMI from MoBa, adult BMI from UK Biobank, and Type II diabetes (T2D) from DIAGRAM to generate “true” direct associations with the exposure and outcome traits. Extending to K = 4, the same SNPs and traits are retained, with an additional inclusion of the 1-year-old BMI GWAS trait from MoBa as the exposure trait. In the simulation scenario involving multivariate traits at each time point, for SNP selection we emply summary statistics from low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides in Genetic Epidemiology Research on Adult Health and Aging (GERA) [4]. The simulation comprises three time points: a first stage with childhood traits with LDL-C, HDL-C, and TG summary statistics from [5]; a second stage with adulthood traits with LDL-C, HDL-C, and TG summary statistics from [6]; and a third stage with the stroke trait as the outcome, utilizing summary statistics from [7].
To generate GWAS summary data for simulation, we first select SNPs based on the selection files with LD clumping and a p-value cutoff at 0.01. For each selected SNP j and each exposure k, we record the estimated effects along with their corresponding standard errors
in the exposure and outcome files. Subsequently, we set “true” direct associations
, and we perform a random shuffle of αkj across j within each trait k to ensure independence across traits. Once the matrix A is generated, we create Γ following our linear model with a pre-specified matrix B. In the simulation, the matrix B is typically designed as an approximation of the true relationships among the traits in the real data. Finally, we simulate our synthetic summary statistics
independently across all SNPs and traits. The results of experiments with K = 3 and K = 4 are based on 500 replications, while for K = 7, we conduct 100 replications.
At each replication, in order to mimic the real case, we would like to run the algorithms with p-values cutoffs for the selection of SNPs. Unlike the real cases, where independent selection files are usually utilized to compute the instrumental strength without selection bias, we directly set p-values for each SNP by computing p-values from and taking the minimum of p-values across k. This can be heuristically considered as if we have independent selection files with no measurement errors.
Specially, the p-values for each SNP j is computed by
where Φ is the Gaussian cumulative distribution function. Bonferroni correction is then applied to select the significant SNPs under different p-values cutoffs.
S2.2 Calculation of genetic correlation
We apply the LDSC package developed in [8] to compute the genetic correlation between each pair of phenotypes. Here we briefly review the calculation of genetic correlation via LD Score regression [9]. Let Xk and Xl be two phenotypes of interests and S be a set of P SNPs . Define
Define
, the heritability explained by SNPs in S, as
and eS (Xk, Xl), the genetic covariance among SNPs in S, as
The genetic correlation between Xk and Xl is then defined as
The estimation of genetic correlation involves the estimation of both heritability and genetic correlation. The heritability can be estimated via the single-trait LD Score regressions from [8]. The single-trait LD Score regression equation of phenotype Xk is
Where
is the expected χ2-statistic of variant j, Nk is the sample size, P is the number of SNPs, lj is the LD-score, a measures the contribution of confounding biases, and h2 is the average heritability. The cross-trait LD regression can be applied to estimate the genetic covariance. The cross-trait LD Score regression equation is
where zkj denotes the z-score for study k and SNP j, Nk is the sample size for study k, eg is the genetic covariance, lj is the LD Score, Nkl is the number of individuals included in both studies and e is the phenotypic correlation among the overlapping samples. Then the genetic covariance can be estimated by regressing zkjzlj against
.
The choices of weights of LD score regression are discussed in the details in [9]. The assessment of statistical significance is then conducted by block jackknife.
S3 Supplemental figures
Simulation results for K = 4. The first row illustrates empirical coverage of 95% confidence intervals of all direct/indirect effects of the exposures on Y over 500 repeated simulations. The second row displays the boxplots of lengths of confidence intervals over repeated simulations.
Simulation results for the multivariate case. a) Empirical coverage of 95% confidence intervals of all direct/indirect effects of the exposures on Y over 500 repeated simulations. b) Boxplots of lengths of confidence intervals over repeated simulations.
Conditional F statistics. a) Conditional F statistics of the exposure traits when we use childhood body size from UK Biobank and adult BMI from GIANT consortium as exposure GWAS datasets b) Conditional F statistics of the exposure traits when we use 8-year-old BMI from MoBa and adult BMI from GIANT consortium as exposure GWAS datasets. Notice that the selection files are the same, thus both scenarios have the same set of SNPs at any given selection p-value threshold.
Additional results for the breast cancer case study. a) Estimated pairwise genetic correlations across traits. A ‘*’ indicates significantly correlated pairs. b) Estimated noise correlation across traits.
Additional results for the stroke case study. a) MVMR results estimating the joint causal effects of adult BMI and childhood SBP on adult SBP. b) Estimated pairwise genetic correlations across traits. A ‘*’ indicates significantly correlated pairs.
S4 Supplemental tables
Number of SNPs used in simulations and real data analysis at any given selection p-value threshold