Selecting causal risk factors from high-throughput experiments using multivariable Mendelian randomization

Modern high-throughput experiments provide a rich resource to investigate causal determinants of disease risk. Mendelian randomization (MR) is the use of genetic variants as instrumental variables to infer the causal effect of a specific risk factor on an outcome. Multi-variable MR is an extension of the standard MR framework to consider multiple potential risk factors in a single model. However, current implementations of multivariable MR use standard linear regression and hence perform poorly with many risk factors. Here, we propose a novel approach to multivariable MR based on Bayesian model averaging (MR-BMA) that scales to high-throughput experiments and can select biomarker as causal risk factors for disease. In a realistic simulation study we show that MR-BMA can detect true causal risk factors even when the candidate risk factors are highly correlated. We illustrate MR-BMA by analysing publicly-available summarized data on metabolites to prioritise likely causal biomarkers for age-related macular degeneration.


Abstract
Modern high-throughput experiments provide a rich resource to investigate causal determinants of disease risk. Mendelian randomization (MR) is the use of genetic variants as instrumental variables to infer the causal effect of a specific risk factor on an outcome. Multivariable MR is an extension of the standard MR framework to consider multiple potential risk factors in a single model. However, current implementations of multivariable MR use standard linear regression and hence perform poorly with many risk factors.
Here, we propose a novel approach to multivariable MR based on Bayesian model averaging (MR-BMA) that scales to high-throughput experiments and can select biomarker as causal risk factors for disease.
In a realistic simulation study we show that MR-BMA can detect true causal risk factors even when the candidate risk factors are highly correlated. We illustrate MR-BMA by analysing publicly-available summarized data on metabolites to prioritise likely causal biomarkers for age-related macular degeneration.
Mendelian randomization (MR) is the use of genetic variants to infer the presence or absence of a causal effect of a risk factor on an outcome. Under the assumption that the genetic variants are valid instrumental variables, this causal effect can be consistently inferred even in the presence of unobserved confounding factors [1]. The instrumental variable assumptions are illustrated by a directed acyclic graph as shown in Figure 1  data. These biomarkers include proteins [3], blood cell traits [4], metabolites [5] or imaging phenotypes such as cardiac image analysis [6]. Highthroughput experiments provide ideal data resources for conducting MR investigations in conjunction with case-control datasets providing genetic associations with disease outcomes (such as from CARDIoGRAMplusC4D for coronary artery disease [7], DIAGRAM for type 2 diabetes [8], or the International Age-related Macular Degeneration Genomics Consortium [IAMDGC] for age-related macular degeneration [9]). In addition to their untargeted scope, one specific feature of high-throughput experiments is a distinctive correlation pattern between the candidate risk factors shaped by latent biological processes.
Multivariable MR is an extension of standard (univariable) MR that allows multiple risk factors to be modelled at once [10]. Whereas univariable MR makes the assumption that genetic variants specifically influence a single risk factor, multivariable MR makes the assumption that genetic variants influence a set of multiple measured risk factors and thus accounts for measured pleiotropy. Our aim is to use genetic variation in a multivariable MR paradigm to select which risk factors from a set of related and potentially highly correlated candidate risk factors are causal determinants of an outcome. Existing methods for multivariable MR are designed for a small number of risk factors and do not scale to the dimension of high-throughput experiments. We therefore seek to develop a method for multivariable MR that can select and prioritize biomarkers from high-throughput experiments as risk factors for the outcome of interest. In this context we propose a Bayesian model averaging approach (MR-BMA) that scales to the dimension of high-throughput experiments and enables risk factor selection from a large number of candidate risk factors. MR-BMA is formulated on summarized genetic data which is publicely available and allows to maximize the sample size.
To illustrate our approach, we analyse publicly available summarized data from a metabolite genome-wide association study (GWAS) on nearly 25 000 participants to rank and prioritise metabolites as potential biomarkers for age-related macular degeneration. Data are available on genetic associations with 118 circulating metabolites measured by nuclear magnetic resonance (NMR) spectroscopy [11]

Results
Multivariable Mendelian randomizaton and risk factor selection Multivariable MR is an extension of the standard MR paradigm ( Figure   1) to model not one, but multiple risk factors as illustrated in Figure 2, thus accounting for measured pleiotropy. The current implementation of multivariable MR is based on an inverse-variance weighted (IVW) linear regression in a two-sample framework, where the genetic associations with the outcome (sample 1) are regressed on the genetic associations with all the risk factors (sample 2) in a multivariable regression. Weights in these regression models are proportional to the inverse of the variance of the genetic association with the outcome. This is to ensure that genetic variants having more precise association estimates receive more weight in the analysis. The causal effect estimate from the multivariable MR represents the direct causal effects of the risk factors in turn on the outcome when all the other risk factors in the model are held constant [12,13]. However, the current implementation of multivariable MR is not designed to consider a high-dimensional set of risk factors and is not suitable to select biomarkers from high-throughput experiments. Figure 2: Directed acyclic graph of instrumental variable assumptions made in multivariable Mendelian randomization. G = genetic variant(s), X j = risk factor j for j = 1, . . . , d, Y = outcome, U = confounders, θ j = causal effect of risk factor j.
To allow joint analysis of biomarkers from high-throughput experiments in multivariable MR we cast risk factor selection as variable selection in a weighted linear regression model. Formulated in a Bayesian framework (for full details we refer to the Methods section) we use independence priors and closed-form Bayes factors to evaluate the posterior probability (PP ) of specific models (i.e. one risk factor or a combination of multiple risk factors). In high-dimensional variable selection, the evidence for one particular model can be small because the model space is very large and many models might have comparable evidence. This is why MR-BMA uses Bayesian model averaging (BMA) and computes for each risk factor its marginal inclusion probability (MIP ), which is defined as the sum of the posterior probabilities over all models where the risk factor is present. MR-BMA reports the model aver-aged causal effects (MACE ) as the direct causal effect of a risk factor on an outcome. As we show in a simulation study on real biomarker data, MR-BMA enables sparse modeling and hence a better and more stable detection of the true causal risk factors.

Detection of invalid instruments
Invalid instruments may be detected as influential points or outliers with respect to the fit of the linear model. Outliers may arise for a number of reasons, but they are likely to arise if a genetic variant has an effect on the outcome that is not mediated by one or other of the risk factors -an unmeasured pleiotropic effect. To quantify outliers we use the Q-statistic, which is an established tool for identifying heterogeneity in meta-analysis [14]. More precisely, to pinpoint specific genetic variants as outliers we use the contribution q of the variant to the overall Q-statistic, where q is defined as the squared difference between the observed and predicted association with the outcome.
Even if there are no outliers, it is advisable to check for influential observations and re-run the approach omitting that influential variant from the analysis. If a particular genetic variant has a strong association with the outcome, then it may have undue influence on the variable selection, leading to a model that fits that particular observation well, but other observations poorly. To quantify influential observations we suggest to use Cook's distance (Cd) [15]. We illustrate the detection of influential points and outliers in the applied example.

Simulation results on NMR metabolite data
In a simulation study on a realistic data structure based on genetic associations with NMR metabolites [11] we compare the performance to detect true causal risk factors of the existing approach (Multivariable IVW regression), the Lasso [16], a penalised regression approach developed for highdimensional regression models, our novel approach MR-BMA and the best model with the highest posterior probability from the Bayesian model selection. To avoid selection bias, we choose genetic variants based on an external data-set. As the majority of the metabolite measures relates to lipids, we take n = 150 independent genetic variants that are associated with any of three composite lipid measurements (LDL-cholesterol, triglycerides, or HDLcholesterol) at a genome-wide level of significance (p < 5 × 10 −8 ) in a large meta-analysis of the Global Lipids Genetics Consortium [17]. We seek to evaluate two aspects of the methods: 1) how well can the competitors se-  Figure 3 and 4) we find that IVW is the only method that gives unbiased estimates, although at the price of a high variance in the estimates. In contrast, estimates from Lasso with strong penalty are highly biased towards the null, but have very low variance. MR-BMA is a compromise between the strong Lasso penalty and the IVW estimate since it has a weaker bias towards the null than the Lasso but a much reduced variance compared to the IVW estimate.  Supplementary Figures 7 and 8), the variance of the IVW estimates is large and prohibits better performance. The Lasso provides sparse solutions with many of the causal estimates set to zero. This allows the Lasso a relative good performance at the beginning of the ROC curve, but its performance weakens when considering more risk factors. The best performance is terms of the ROC characteristics is observed for MR-BMA. In terms of MSE (Table 1), the dominant role of the variance of the IVW estimate again becomes apparent as the IVW method has a thousand times larger MSE than MR-BMA, which has the lowest MSE for all scenarios considered.

Metabolites as risk factors for age-related macular degeneration
Next we demonstrate how MR-BMA can be used to select metabolites as causal risk factors for age-related macular degeneration (AMD). AMD is a painless eye-disease that ultimately leads to the loss of vision. AMD is highly heritable with an estimated heritability of up to 0.71 for advanced AMD in a twin study [18]. A GWAS meta-analysis has identified 52 independent common and rare variants associated with AMD risk at a level of genomewide significance [9]. Several of these regions are linked to lipids or lipidrelated biology, such as the CETP, LIPC, and APOE gene regions. A recent multivariable MR analysis has shown that HDL-C may be a putative risk factor for AMD, while there was no evidence of a causal effect for LDL-C and triglycerides [19]. Here, we extend this analysis to consider not just three lipid measurements, but a wider range of d = 49 metabolite measurements as measured in the metabolite GWAS described earlier [11] for the same lipidrelated instrumental variants as described previously. First, we prioritise   [19,20].
We repeat the analysis without the three influential and/or heterogeneous variants (n = 145), and report the ten risk factors with the largest marginal inclusion probability in Table 2  These results confirm previous studies [19,20] that identified HDL-C as a putative risk factor for AMD and draw the attention to extra-large and large HDL particles. As a further sensitivity analysis (detailed results not shown), we repeat this analysis with a different selection of instrumental variables using n = 56 independent genetic variants that were genome-wide hits for any metabolite measurement in this dataset [11]. Cholesterol content in large HDL particles is still selected with high posterior probability for this choice

Discussion
We here introduce a novel approach for multivariable MR, MR-BMA, which scales to the analysis of high-throughput experiments. This model averaging procedure prioritises and selects causal risk factors in a Bayesian framework from a high-dimensional set of related candidate risk factors. Our approach is especially suited for sparse settings, i.e. when the proportion of true causal risk factors compared to all risk factors considered is small. We demonstrated the approach with application to a dataset of NMR metabolites, which included predominantly lipid measurements, using variants associated with lipids as instrumental variables. Previous MR analysis [19,20] including three lipid measurements from the Global Lipids Genetics Consortium [17] have identified HDL-C as potential risk factor for AMD. Our new approach to multivariable MR refined this analysis using NMR metabolites as highthroughput risk factor set and confirmed HDL-C as a potential causal risk factor for AMD, further pinpointing that large or extra-large HDL particles are likely to be driving disease risk.
Other areas of application where this method could be used include imaging measurements of the heart and coronary artery disease, body composition measures and type 2 diabetes, or blood cell traits and atherosclerosis.
As multivariable MR accounts for measured pleiotropy, this approach facilitates the selection of suitable genetic variants for causal analyses. In each case, it is likely that genetic predictors of the set of risk factors can be found, even though finding specific predictors of, for example, particular heart measurements from cardiac imaging, may be difficult given widespread pleiotropy [21]. This approach allows a more agnostic and hypothesis-free approach to causal inference, allowing the data to identify the causal risk factors.
Multivariable MR estimates the direct effect of a risk factor on the outcome and not the total effect as estimated in standard univariable MR. This When genetic variants are weak predictors for the risk factors, this can introduce weak instrument bias. In 2-sample MR, any bias due to weak instruments is towards the null and does not lead to inflated type 1 error rates [22]. Consequently, we need to be cautious about the interpretation of null findings, particularly in our example for non-lipid risk factors, as these might be deprioritised in terms of statistical power by our choice of instruments. A further requirement for multivariable MR is that the genetic variants can distinguish between risk factors [13]. We recommend to check the correlation structure between genetic associations for the selected genetic variants and to include no pair of risk factors which is extremely strongly correlated. In the applied example, we included only risk factors with an absolute correlation less than 0.98. As we were not able to include more than three measurements for each lipoprotein category (cholesterol content, triglyceride content, diameter), care should be taken not to overinterpret findings in terms of the specific measurements included in the analysis rather than those correlated measures that were excluded from the analysis (such as phospholipid and cholesterol ester content). In conclusion, we introduce here MR-BMA, the first approach to perform risk factor selection in multivariable MR, which can identify causal risk factors from a high-throughput experiment. MR-BMA can be used to determine which out of a set of related risk factors with common genetic predictors are the causal drivers of disease risk.

Methods
Methods is available online.

Methods
Mendelian Randomization data input: Summarized data set-up One of the key features of Mendelian Randomization (MR) is that the approach can be performed using summarised data on genetic associationsbeta-coefficients and their standard errors from univariate regression analyses. No access to individual-level genotype data is needed. Additionally, these association estimates can be derived from different samples. In twosample MR, the genetic associations with the risk factor are derived from one sample and the genetic associations with the outcome from another sample [1]. The use of summarised data in two-sample MR allows the sample size to be maximised by integrating data from large meta-analyses including hundreds of thousands of participants.
We assume the context of two-sample MR with summarized data [2]. For each genetic variant i = 1, . . . , n and each risk factor j = 1, . . . , d, we take the beta-coefficient β X ij and standard error se(β X ij ) from a univariable regression in which the risk factor X j is regressed on the genetic variant G i in sample one, and beta-coefficient β Y i and standard error se(β Y i ) from a univariable regression in which the outcome Y is regressed on the genetic variant G i in sample two. For simplicity of notation, although the beta-coefficients are estimates, we omit the conventional "hat" notation and treat the betacoefficients as observed data points. When considering multiple risk factors, we construct a matrix of beta-coefficients β X of dimension n × d, where d is the number of risk factors and n is the number of genetic variants.
We assume that the genetic effects on risk factors and on the outcome are linear and homogeneous across the population, and identical between the two samples [3]. Furthermore, we assume that the n genetic variants selected as instrumental variables are independent, an assumption common in MR studies. This is usually achieved by including only the lead genetic variant from each gene region in the analysis. Finally, we assume that genetic association estimates are derived from two distinct samples with no overlap between the samples. These assumptions can all be relaxed to some extent if the goal is causal inference rather than causal estimation; see [4] for details.

Multivariable Mendelian randomisaton and the linear model
Multivariable MR is an extension of the standard MR paradigm (Figure 1) to model not one, but multiple risk factors as illustrated in Figure 2. Univariable MR can be cast as a weighted linear regression model in which the genetic associations with the outcome β Y i are regressed on the genetic associations with the risk factor β X i [5] β In multivariable MR, the genetic associations with the outcome are regressed on the genetic associations with all the risk factors [6] Weights in these regression models are proportional to inverse of the variance of the genetic association with the outcome (se(β Y i ) −2 ). This is to ensure that genetic variants having more precise association estimates receive more weight in the analysis. To account for heterogeneity in this equation, we can use a multiplicative random effects model, which increases the variance of the error terms by a multiplicative factor [7]. The same weighting can also be achieved by standardising the association estimates, by dividing β Y i and β X i by se(β Y i ). In the following derivations, we assume that β Y and β X are standardised, so that the variances of the i terms are all 1. Our parameter of interest is the vector of regression coefficients θ = {θ 1 , ..., θ d }. These are the direct causal effects of the risk factors in turn on the outcome when all the other risk factors in the model are held constant [8]. In contrast, univariable Mendelian randomization using genetic variants that are instrumental variables for the specific risk factor of interest estimates the total effect of the risk factor on the outcome. The direct effect will differ from the total effect if the effect of the risk factor is mediated via another risk factor included in the model [9]. In some cases (such as to identify the proximal risk factor to the outcome), the direct effect is of interest; in other cases (such as to evaluate the potential impact of intervening on a risk factor), it is the total effect that is truly of interest [8].

Choosing genetic variants as instruments
In multivariable MR, a genetic variant is a valid instrumental variable if the following criteria hold: • IV1 Relevance: The variant is associated with at least one of the risk factors.
• IV2 Exchangeability: The variant is independent of all confounders of each of the risk factor-outcome associations.
• IV3 Exclusion restriction: The variant is independent of the outcome conditional on the risk factors and confounders.
One of the main differences of multivariable MR compared to univariable MR is the relaxation of the exclusion restriction condition. In contrast to univariable MR, multivariable MR alllows for measured pleiotropy [10] via any of the observed risk factors. It is not necessary for every genetic variant to be associated with all the risk factors, although if no genetic variants are associated with a particular risk factor, then the causal effect of that risk factor cannot be identified. This would also occur if the genetic associations with two risk factors were exactly proportional. For precise identification of causal risk factors, it is necessary to have some variants that are more strongly associated with particular risk factors than others [9].
We initially assume that all genetic variants are valid instruments. There is an emerging literature [11,12] on how to perform robust MR analysis in the presence of invalid instruments; similar extensions can be adapted for multivariable MR [10].

Risk factor selection as variable selection in the linear model
We consider the situation in which we have a set of genetic variants that are instrumental variables for a set of risk factors, and we want to select which of those risk factors are causes of the outcome. Our implicit prior belief is that not all of the risk factors are causally related to the outcome and that the set of true causal risk factors is sparse. We formulate the selection of risk factors in two-sample multivariable MR as a variable selection task in the linear regression framework. In order to model the correlation between risk factors we base our likelihood on a Gaussian distribution Following the D2 prior specifications as introduced in [13], we use the following conjugate priors for the causal effects θ, the residual error , and the precision τ where ν = diag(σ 2 ) is the diagonal variance matrix of the causal effects (independence prior), and the precision τ is assumed to follow a Gamma distribution with hyperparameters κ as the shape and λ as the scale parameter.
Next, we introduce a binary indicator γ of length d that indicates which risk factors are selected and which ones are not The where Θ = Ωβ t Xγ β Y is the causal effect estimate and Ω = ( is the inverse of the shrinkage covariance between the genetic associations of the risk factors. For a detailed derivation of the Bayes factor we refer to the Supplementary Note S1.

Prior specification
Another important aspect is the prior for the model size k, which we model using a Binomial distribution This requires choosing the probability p of including a risk factor in the model according to prior assumptions regarding the sparsity of the results.
We recommend to select p according to the expected a priori model size, which is p × d. Currently, all risk factors are assumed to have the same prior probability, and thus the probability of all models of the same size k is equal.
The prior of a specific model M γ of size k is defined as The second important aspect is the prior for the variance of the risk factors ν = diag(σ 2 ), where we assume that all risk factors have the same prior variance σ 2 . Following [13] we set σ 2 = 0.25, but sensitivity of the results with respect to this prior should be investigated.
In high-dimensional variable selection, the evidence for one particular model can be small because the model space is very large and many models might have comparable evidence. This is why MR-BMA uses Bayesian model averaging (BMA) and computes for each risk factor j its marginal inclusion probability (MIP ), which is defined as the sum of the posterior probabilities over all models where the risk factor is present where I(γ j = 1) equals 1 if risk factor j is part of the model and 0 otherwise.

Causal estimation
We derive the estimates for the causal effectsθ γ of model M γ aŝ and the model-averaged causal estimate (MACE) for risk factor j from the MR-BMA approach aŝ MR-BMA ranks and prioritises risk factors according to their marginal inclusion probability and estimates the MACE as defined in equation (12).
As an alternative approach, we also consider selecting the 'best model' based on the individual model posterior probabilities as defined in equation (9).

Detection of invalid instruments
Invalid instruments may be detected as influential points or outliers with respect to the fit of a specific linear model M γ . We recommend to check the best individual models for outliers by visual inspection of the scatterplot of the predicted associations based on M γ with the outcomeβ Y = β Xγθγ against the actual observed observations β Y . If a genetic variant is detected consistently as an outlier in several of the top models, it may be advisable to explore the analyses excluding that outlying variant from the analysis.
To quantify outliers we use the Q-statistic, which is an established tool for identifying heterogeneity in meta-analysis [15]. It is defined as the sum of the residual vector q, which is the squared difference between the observed and predicted association with the outcome where h i is the ith diagonal element of the hat matrix H = β Xγ (ν −1 γ + β t Xγ β Xγ ) −1 β t Xγ , and s 2 = 1 n−d t is the mean squared error of the regression model.

Simulation study on metabolite GWAS
To evaluate the performance of MR-BMA, we perform a simulation study using publicly-available summarized data on genetic associations with risk factors derived from a recent metabolite GWAS [17] as introduced earlier.
All of the metabolites were inverse rank-based normal transformed, so the association estimates are all in standard deviation units.
In order to avoid selection bias, we choose genetic variants based on an external data-set. As the majority of the metabolite measures relates to lipids, we take n = 150 independent genetic variants that are associated with any of three composite lipid measurements (LDL-cholesterol, triglycerides, or HDL-cholesterol) at a genome-wide level of significance (p < 5 × 10 −8 ) in a large meta-analysis of the Global Lipids Genetics Consortium [18]. We extract beta-coefficients and standard errors of genetic associations for the 150 genetic variants and the 118 available metabolites. Next, we compute the genetic correlation structure between metabolites based on the n = 150 instrumental variables and exclude at random one of each pair of metabolites that are in stronger correlation than |r| > 0.99. Our final data-set β X for the simulation study comprises associations of d = 92 metabolites measured on n = 150 genetic variants. This allows us to investigate risk factor selection for a realistic genetic correlation structure between metabolites and distribution of the regression coefficients.
After taking genetic associations with the risk factors from the real dataset, we simulate genetic associations with the outcome β Y based on a subset of risk factors, which we refer to as the 'true' risk factors. We investigate the following 12 different scenarios: • Size of the data set: moderate (d = 12 metabolites selected at random) and large (d = 92 all metabolites available) We compare four different analysis methods: • Multivariable inverse variance weighted (IVW) regression (equation 2) [19] • Lasso as regularised regression [20] • MR-BMA using marginal inclusion probabilities • Bayesian best model selection using posterior probabilities of individual models Lasso is a L1 regularised linear regression method which has been devised for variable selection in high-dimensional data. The regularisation parameter of Lasso is set to 0.1 (min, strong penalty) or 0.9 (max, weak penalty), where a penalty equal to 1 reflects the unpenalised IVW regression. Additionally, we use cross-validation (CV) to determine the penalty parameter. For the moderate risk factor space including 12 metabolites, the MR-BMA approach is performed using an exhaustive search of all possible models with a prior probability of a risk factor to be included set to p = 0.5, while for the large risk factor space we employ the stochastic search with 10,000 iterations and p = 0.1. This reflects an expected a priori model size of six for the moderate risk factor space and around nine for the large risk factor space. The prior variance σ 2 is fixed to 0.25.
Data pre-processing and analysis for applied example of age-related macular degeneration In the applied example we demonstrate how MR-BMA can be used to select metabolites as causal risk factors for age-related macular degeneration (AMD). As risk factors we consider a range of circulating metabolites measured by nuclear magnetic resonance (NMR) spectroscopy [17] from http://computationalmedicine.fi/data#NMR_GWAS and we use the same lipid-related instrumental variants as described previously. We restrict the risk factor space to include only lipoprotein measurements on total cholesterol content, triglyceride content, and particle diameter; for the various fatty acid measurements we only included total fatty acids. This results in none of the d = 49 metabolite measures having correlations in their genetic associations of |r| > 0.98 (Supplementary Figure 9). Genetic associations with the outcome are taken from the latest large-scale GWAS metaanalysis on AMD [21] including 16, 144 patients and 17, 832 controls which is available from http://csg.sph.umich.edu/abecasis/public/amd2015/.
To synchronise the genetic data on the metabolite risk factors and the AMD outcome, we match the effect alleles and we remove two genetic variants missing in the AMD data, so that the overall analysis includes n = 148 variants.
Finally, we use the Ensembl Variant Effect Predictor [22] to annotate the genetic variants to the gene that is most likely affected.
We run MR-BMA including all n = 148 available genetic variants on the  Table 8).

Web resources
MR-BMA and publicly available summary data on AMD and NMR metabolites as presented in the applied example is public on https://github.com/ verena-zuber/demo_AMD