Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Non-linear randomized Haseman-Elston regression for estimation of gene-environment heritability

View ORCID ProfileMatthew Kerin, View ORCID ProfileJonathan Marchini
doi: https://doi.org/10.1101/2020.05.18.098459
Matthew Kerin
1Wellcome Trust Center for Human Genetics, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Matthew Kerin
Jonathan Marchini
2Regeneron Genetics Center, Tarrytown, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jonathan Marchini
  • For correspondence: jonathan.marchini@regeneron.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Gene-environment (GxE) interactions are one of the least studied aspects of the genetic architecture of human traits and diseases. The environment of an individual is inherently high dimensional, evolves through time and can be expensive and time consuming to measure. The UK Biobank study, with all 500,000 participants having undergone an extensive baseline questionnaire, represents a unique opportunity to assess GxE heritability for many traits and diseases in a well powered setting. We have developed a non-linear randomized Haseman-Elston (RHE) regression method applicable when many environmental variables have been measured on each individual. The method (GPLEMMA) simultaneously estimates a linear environmental score (ES) and its GxE heritability. We compare the method via simulation to a whole-genome regression approach (LEMMA) for estimating GxE heritability. We show that GPLEMMA is computationally efficient and produces results highly correlated with those from LEMMA when applied to simulated data and real data from the UK Biobank.

Introduction

The advent of genome-wide association studies1 has catalyzed a huge number of discoveries linking genetic markers to many human complex diseases and traits. For the most part these discoveries have involved common variants that confer relatively small amounts of risk and only account for a small proportion of the phenotypic variance of a trait2. This has led to a surge of interest in methods and applications that measure the joint contribution to phenotypic variance of all measured variants throughout the genome (SNP heritability), and in testing individual variants within this framework. Most notably the seminal paper of Yang et al. (2010), who used a linear mixed model (LMM) to show that the majority of missing heritability for height could be explained by genetic variation by common SNPs3. When testing variants for association these LMMs can reduce false positive associations due to population structure, and improve power by implicitly conditioning on other loci across the genome4–6. These methods model the unobserved polygenic contribution as a multivariate Gaussian with covariance structure proportional to a genetic relationship matrix (GRM) 7–9. This approach is mathematically equivalent to a whole genome regression (WGR) model with a Gaussian prior over SNP effects 4.

Subsequent research has shown that the simplest LMMs make assumptions about the relationship between minor allele frequency (MAF), linkage disequilibrium (LD) and trait architecture that may not hold up in practice10,11 and generalisations have been proposed that stratify variance into different components by MAF and LD10, 12, 13. Other flexible approaches have been proposed in both the animal breeding14, 15 and human literature16–18 to allow different prior distributions that better capture SNPs of small and large effects. For example, a mixture of Gaussians (MoG) prior can increase power to detect associated loci in some (but not all) complex traits6, 17. Other methods have been proposed that estimate heritability only from summary statistics and LD reference panels 19, 20. Heritability can also be estimated using Haseman-Elston regression 21 and has recently been extended using a randomised approach 22 that has Embedded Image computational complexity and works for multiple variance components 23. Other recent work has shown that LMM approaches such as these are not able to disentangle direct and indirect genetic effects, the balance of which will vary depending on the trait being studied. 24.

There has been less exploration of methods for estimating heritability that account for gene-environment interactions. One interesting approach has proposed using spatial location as a surrogate for environment25 using a three component LMM - one based on genomic variants, one based on measured spatial location as a proxy for environmental effects, and a gene-environment component, modeled as the Hadamard product of the genomic and spatial covariance matrices. Other authors have used this method to account for gene-gene interactions 26, 27.

Modelling gene-environment interactions when many different environmental variables are measured is a more challenging problem. If several environmental variables drive interactions at individual loci, or if an unobserved environment that drives interactions is better reflected by a combination of observed environments, it can make sense to include all variables in a joint model. StructLMM28 focuses on detecting GxE interactions at individual markers, and models the environmental similarity between individuals (over multiple environments) as a random effect, and then tests each SNP independently for GxE interactions. However this approach does not model the genome wide contribution of all the markers, which is often a major component of phenotypic variance.

We recently proposed a whole genome regression approach called LEMMA applicable to large human datasets such as UK Biobank, where many potential environmental variables are available29. The LEMMA regression model includes main effects of each genotyped SNP across the genome, and also interactions of each SNP with a environmental score (ES), that is a linear combination of the environmental variables. The ES is estimated as part of the method. The model uses mixture of Gaussian (MoG) priors on main and GxE SNP effects, that allow for a range of different genetic architectures from polygenic to sparse genetic effects16–18. The ES can be readily interpreted and its main use is to test for GxE interactions one variant at a time, typically at a larger set of imputed SNPs in the dataset. However, the ES can also be used to estimate the proportion of phenotypic variability that is explained by GxE interactions (SNP GxE heritability), using a two component randomised Haseman-Elston (RHE) regression 23.

The main contribution of this paper is to combine the estimation of the LEMMA ES into a standalone RHE framework. This results in a non-linear optimization problem that we solve using the Levenburg-Marquardt (LM) algorithm. The method implicitly assumes a Gaussian prior on main effect and GxE effect sizes. We also propose a separate RHE method that estimates the independent GxE contribution of each measured environmental variable. We set out the differences between these two models and present a simulation study to compare them to LEMMA. We also apply the method to UK Biobank data and show that GPLEMMA produces estimates very close to LEMMA. Software implementing the GPLEMMA algorithm in C++ is available at https://jmarchini.org/gplemma/.

Methods

Modeling SNP Heritability

The simplest model for estimating SNP heritability has the form Embedded Image where y is a continuous phenotype, X is an N × M matrix of genotypes that has been normalised to have column mean zero and column variance one, and β is an M-vector of SNP effect sizes. Integrating out β leads to the variance component model Embedded Image where Embedded Image is known as the genomic relationship matrix (GRM)3. Estimating the two parameters in this model σg and σe leads to an estimate of SNP heritability of Embedded Image. This is commonly referred to in the literature as the single component model. Subsequent research has shown that the single component model makes assumptions about the relationship between minor allele frequency (MAF), linkage disequilibrium (LD) and trait architecture that may not hold up in practice10, 11. There have been many follow up methods, including; generalizations that stratify variance into different components by MAF and LD 13, approaches that assign different weights for the GRM10, 12, methods that replace the Gaussian prior on β with a spike and slab on SNP effect sizes 30 and methods that estimate heritability only from summary statistics and LD reference panels 20, 31.

Haseman-Elston (HE) regression

An alternative method used to compute heritability is known as HE-regression21. HE-regression is a method of moments (MoM) estimator that optimizes variance components Embedded Image in order to minimise the squared difference between the observed and expected trait covariances. The MoM estimator Embedded Image can be obtained by solving the minimization Embedded Image or equivalently by solving the linear regression problem Embedded Image where vec(A) is the vectorization operator that transforms an N × M matrix into an NM-vector.

In matrix format, both of these forms correspond to solving the following linear system Embedded Image

HE-regression methods are widely acknowledged to be more computationally efficient22, 32, 33 and do not require any assumptions on the phenotype distribution beyond the covariance structure 32 (in contrast to maximum-likelihood estimators). However, HE-regression based estimates typically have higher variance 33, thus implying that they have less power.

Recent method developments22, 23 have shown that a randomized HE-regression (RHE) approach can be used to compute efficiently on genetic datasets with hundreds of thousands of samples.

Wu et al. (2018) observed that Equation (1) can be solved efficiently without ever having to explicitly compute the kinship matrix K by using Hutchinson’s estimator 34, which states that tr Embedded Image for any matrix where z is a random vector with mean zero and covariance given by the identity matrix. The proposed method involves approximating tr (K) and tr (K2) using only matrix vector multiplications with the genotype matrix X, to compute the following expressions Embedded Image

Thus an approximate solution can be obtained in Embedded Image time, where B denotes a relatively small number of random samples. Subsequent work by extended this approach to a multiple component model 23 Embedded Image With parameter estimates obtained as solution to the linear system given by Embedded Image where Tkl = tr (KkKl), bk = tr (Kk) and ck = yTKky. Finally both papers show how to efficiently control for covariates by projecting them out of of all terms in the system of equations. Thus with covariates included the multiple component model becomes Embedded Image and terms in the subsequent linear system are given by Tkl = tr (WKkWKlW), bk = tr (WKkW) and ck = yTWKkWy, where W = IN − CT(CTC)-1C.

Modeling GxE heritability

We introduce two extensions of the RHE framework for modelling GxE interactions with multiple environmental variables. In both models we let E be an N × L matrix of environmental variables, C is an N × D matrix of covariates, each with columns normalised to have mean zero and variance one.

MEMMA

The first model assumes that each environmental variable interacts independently with the genome Embedded Image where Embedded Image and ElʘX denotes the element-wise product of El with each column of X. Integrating out β and λ leads to the variance component model Embedded Image where Embedded Image and Embedded Image

Fitting the variance components is done analytically by solving the system of equations Tθ = c where Tkl = tr (WKkWKlW), ck = yTWKkWy and W = IN − C(CTC)-1CT. As shown in the original RHE method22, 23, Hutchinson’s estimator can be used to efficiently estimate Tkl. To do this our software streams SNP markers from a file and computes yTWXXTWy and the following N-vectors Embedded Image Embedded Image where zb ~ N(0, IN)for 1 ≤ b ≤ B are random N-vectors. Then Embedded Image where Embedded Image is defined as Embedded Image

Finally, the variance components are converted to heritability estimates using the following formula Embedded Image

We call this approach MEMMA (Multiple Environment Mixed Model Analysis). MEMMA costs Embedded Image in compute and Embedded Image in RAM.

GPLEMMA

The second model involves the estimating a linear combination of environments, or environmetal score (ES), that interacts with the genome. The model is given by Embedded Image where η = Ew is the linear environmental score (ES), Embedded Image and Embedded Image. This is the same model used by LEMMA29 except the mixture of Gaussians priors on SNP effects (β and γ) have been replaced with Gaussian priors. For this reason we call this approach GPLEMMA (Gaussian Prior Linear Environment Mixed Model Analysis). Integrating out the SNP effects yields the model Embedded Image where Embedded Image and Fl = El ʘ X. Minimizing the squared loss between the expected and observed covariance is equivalent to the following regression problem Embedded Image

In this format it is clear that optimising Embedded Image is a non-linear regression problem. Further, including a parameter for Embedded Image is no longer necessary. From here on we set Embedded Image and drop the Embedded Image parameterisation without loss of generality.

Levenburg-Marquardt algorithm

We use the Levenburg-Marquardt (LM) algorithm35, which is commonly used for non-linear least squares problems. The algorithm effectively interpolates between the Gauss-Newton algorithm and the method of steepest gradient descent, by use of an adaptive damping parameter. In this manner, it is more robust than the straight forward Gauss-Newton algorithm but should have faster convergence than a gradient descent approach.

Without loss of generality, consider the model Embedded Image where f(θ) is a function that is non-linear in the parameters θ. Given a starting point θ0, LM proposes a new point θnew = θ0 + δ by solving the normal equations Embedded Image where Embedded Image and ϵ(θ0) = Y − f(θ0) are respectively the Jacobian and the residual vector evaluated at θ0.

If θnew has lower squared error than θ0, then the step is accepted and the adaptive damping parameter μ is reduced. Otherwise, μ is increased and a new step δ is proposed. For small values of μ Equation (9) approximates the quadratic step appropriate for a fully linear problem, whereas for large values of μ Equation (9) behaves more like steepest gradient descent. This allows the algorithm to defensively navigate regions of the parameter space where the model is highly non-linear. If θ + δ reduces the squared error, then the step is accepted and μ is reduced, otherwise μ is increased and a new step δ is proposed.

In summary the LM algorithm requires computation of the matrices J(θ)TJ(θ), J(θ)Tϵ(θ) at each step, as well as the squared error (which we define as S(θ)). We now give statements of the equations used to compute each of these values, and show that each iteration can be performed in Embedded Image time.

We apply the LM algorithm with Embedded Image and Embedded Image

Several quantities can be pre-calculated and re-used in the LM algorithm. The N-vectors ub, vb,l, and yTWXXTWy are needed and have been defined above. In addition, GPLEMMA also benefits from the pre-calculation of Embedded Image which can also be computed as genotypes are streamed from file.

Let (JT J)θi,θj denote the entry of the JTJ that corresponds to Embedded Image for Embedded Image and define the N-vector vb(w) = ∑lwlvbl. Then the (L + 2) × (L + 2) matrix J(θ)TJ(θ) is given by Embedded Image J(θ)Tϵ(θ) is given by Embedded Image where Embedded Image

Finally the squared error, which we define as S(θ), is given by Embedded Image where Embedded Image

The initial preprocessing step has costs Embedded Image in compute and Embedded Image in RAM. The remaining algorithm does not require much RAM in addition to that required in the preprocessing step, so also costs Embedded Image in RAM. Construction of the summary variable vb(w) = ∑iwlvb,l costs Embedded Image in compute. Each iteration of the LM algorithm costs Embedded Image.

It is possible to parallelise GPLEMMA using OpenMPI by partitioning samples across cores, in a similar manner to that used by LEMMA 29. Given that evaluating the objective function Embedded Image is characterised by BLAS level 1 array operations, a distributed algorithm using OpenMPI should have superior runtime versus an the same algorithm using OpenMP as well as providing RAM limited only by the size of a researchers compute cluster.

We perform 10 repeats of the LM algorithm with different initialisations, and keep results from the solution with lowest squared error Embedded Image. Each run is initialised with a vector of interaction weights Embedded Image, where each entry set to Embedded Image and a small amount of Gaussian noise is added. Embedded Image

To transform the initial weights vector Embedded Image to the initial parameters θ0 we let Embedded Image be solutions to Embedded Image

The GPLEMMA algorithm is then initialized with Embedded Image where Embedded Image.

Relationship between MEMMA and GPLEMMA

Comparing Equation (3) with Equation (6), suggests that the GPLEMMA model can be expressed at the MEMMA model with the added constraint that Embedded Image where Λ = {λ1,..., λM} is the L × M matrix of GxE effect sizes in MEMMA for the L environments and M SNPs.

We can expect the two models to give similar heritability estimates, under the simplifying assumptions that GxE interactions do occur with a single linear combination of the environments and that the set of random variables Embedded Image is mutually independent. Let Embedded Image and Embedded Image. Then connection between the two models is revealed by observing Embedded Image where Embedded Image. Thus we should expect both models to have the same estimate for the proportion of variance explained by GxE interaction effects.

Even in that case that MEMMA and GPLEMMA have the same expected heritability estimate, there are still some differences between the two. GPLEMMA is a constrained model, so the variance of its heritabiity estimates may be smaller. Further, although Embedded Image is proportional to the square of the weights used to construct the ES the sign of the interaction weight wl has been lost. Thus it is not possible to reconstruct an ES for use in single SNP testing using MEMMA.

Results

Simulated data

We carried out a simulation study to assess the relative properties of MEMMA and GPLEMMA. In addition, we compared to running the whole genome regression model in LEMMA, which estimates an ES and then uses it to estimate the GxE heritability.

The simulations use real data subsampled from genotyped SNPs in the UK Biobank36, drawing SNPs from all 22 chromosomes in proportion to chromosome length and using unrelated samples of mixed ancestry (N = 25k; 12500 white British, 7500 Irish and 5000 white European, N = 50k; 29567 white British, 7500 Irish and 12568white European, N = 100k; 79567 white British, 7500 Irish and 12568 white European; using self-reported ancestry in field f.21000.0.0). All samples were genotyped using the UKBB genotype chip and were included in the internal principal component analysis performed by the UK Biobank. Environmental variables were simulated from a standard Gaussian distribution.

Phenotypes for the baseline simulations were all simulated according to the LEMMA model 29. Let N be the number of individuals, M the total number of SNPs, Mg the number of causal main effect SNPs, MGxE the number of SNPs with GxE effects, L the total number of environmental variables, Lactive the number of ’active’ environments with non-zero contribution to the ES vector w, and Embedded Image and Embedded Image the herirtability of main effects and GxE effects. The model used to simulate data is Embedded Image where X represents the N × M genotype matrix after columns have been standardised to have mean zero and variance one, C is the first principle component of the genotype matrix and E is the N × L matrix of environmental variables. In all simulations a was set such that Cα explained one percent of trait variance.

Non-zero elements of the interaction weight vector w were drawn from a decreasing sequence Embedded Image

The effect size parameters β and γ were simulated from a spike and slab prior such that the number of non-zero elements was governed by Mg and MGxE for main and interaction effects respectively. Non-zero elements were drawn from a standard Gaussian, and then subsequently rescaled to ensure that the heritability given by main and interaction effects was Embedded Image and Embedded Image respectively. We chose a set of baseline parameter choices: N = 25K; M = 100K; L = 30; Lactive = 6, Mg = 2500; MGxE = 1250; Embedded Image; Embedded Image, and then varied one parameter at a time to examine the effects of sample size, number of environments, number of non-zero SNP effects and GxE heritability. In addition, we investigated performance using a larger baseline simulation with N = 100K samples and M = 300K variants. The first genetic principal component was provided as a covariate to all methods.

Figure 1 compares estimates of the percentage variance explained (PVE) by GxE effects from all three methods. In general, all methods had upwards bias that decreased with sample size and increased with the number of environments. While heritability estimates from LEMMA and GPLEMMA appeared quite similar, estimates from MEMMA had much higher variance and also appeared to have higher upwards bias as the total number of environments increase. All the methods exhibited less bias in the larger simulations with N = 100K samples and M = 300K variants (Figure 1 (e-g)).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: PVE estimation.

Estimates of the proportion of variance explained by GxE effects by LEMMA, MEMMA and GPLEMMA on baseline simulations with using N = 25K samples and M = 100K variants, whilst varying sample size (a), the number of environments (b), the number of non-zero SNP effects (c) and GxE heritability (d). Panels (e-g) shows results of simulations with N = 100K samples and M = 300K variants, whilst varying the number of environments (e), the number of non-zero SNP effects (f) and GxE heritability (g).

Figure 2 compares the absolute correlation between the simulated ES and the ES inferred by LEMMA and GPLEMMA. Models like MEMMA do not provide an estimate of the ES. In general, the estimated ES from GPLEMMA had slightly lower absolute correlation with the true ES than the estimated ES from LEMMA, likely due to the data having been simulated from the LEMMA model, with sparse main and GxE SNP effects, whereas the GPLEMMA model assumes a polygenic or infinitesimal model. In large sample sizes (N = 100k), both methods achieve a correlation of over 0.98 with the simulated ES.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2: Comparison on baseline simulations.

Absolute correlation between the true ES and the ES inferred by LEMMA and GPLEMMA whilst varying the number of environments, the number of active environments, the number of non-zero SNP effects and GxE heritability. The top row contains simulations using N = 25K samples and M = 100K variants, the bottom row containhs simulations using N = 100K samples and M = 300K variants. Results from 15 repeats shown.

Figure 3 compares MEMMA, GPLEMMA and LEMMA in a simulation where the functional form of a heritable environmental variable was misspecified (or more specifically; the phenotype depended on the squared effect of a heritable environment). All methods were first tested without any attempt to control for model misspecification, and second using a preprocessing strategy where each environment was tested independently for squared effects on the phenotype and any squared effects with p-value < 0.01/L were included as covariates. These are referred to as (-SQE) and (+SQE) respectively in the figures. Using the (-SQE) strategy, all methods showed upwards bias in estimates of GxE heritability that increased with the strength of the squared effect on the phenotype (Figure 3b). Model misspecification also caused bias in the ES of both GPLEMMA and LEMMA, however bias in the ES from GPLEMMA appeared to be much worse (Figure 3a). Using the (+SQE) strategy, all GxE heritability estimates were unbiased, consistent with earlier simulation results.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3: Comparison on simulations with a misspecified heritable environment

Estimated proportion of trait variance explained by GxE effects is shown on the left, absolute correlation between the inferred ES and the true ES shown in the right. Results shown using LEMMA, MEMMA and GPLEMMA. Phenotypes simulated with a squared effect from a heritable confounder (see ??). Results from 20 repeats shown. Abbreviations; (-SQE), no attempt to control for squared effects; (+SQE), squared effects with p < 0.01 (Bonferroni correction for multiple envs) included as covariates

Figure 4 displays simulation results on the computational complexity of GPLEMMA. Figures 4a and 4b show that GPLEMMA achieved perfect strong scaling1 on the range of cores tested. This suggests that GPLEMMA has superior scalability to LEMMA, as for LEMMA the speedup due to increased cores began to decay after the number of samples per core dropped below 3000 29.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4: Computational complexity of GPLEMMA in simulation.

Strong scaling of GPLEMMA using OpenMPI to parrallelise across cores with (a) N = 25k samples and (b) N = 50k samples. Comparison of the runtime of the Levenburg-Marquardt algorithm (c, d) and runtime of the preprocessing step (e, f). By default each run used; four cores, N = 25k samples, L = 30 environments and 10 random starts of the Levenburg-Marquardt algorithm. Results from 15 repeats shown.

Time to compute the preprocessing step and solve the non-linear least squared problem are shown in Figures 4c to 4f, while the number of environments and sample size were varied. As expected, the preprocessing step appeared to be linear in both the number of environments and sample size. Time to solve the non-linear least squares problem appeared to be quadratic in the number of environments and approximately linear in sample size N. As a single LM iteration should have complexity Embedded Image, this suggests that the number of iterations required for convergence of the LM algorithm was independent of sample size and the number of environments (at least for the range of values tested).

Finally, to give a direct comparison between LEMMA and GPLEMMA, we ran each method on simulated data with N = 100k samples, M = 100k SNPs and L = 30 environmental variables using 4 cores for each run. Over 20 repeats, LEMMA took an average of 648 minutes to run whereas GPLEMMA took an average of 233 minutes.

Analysis of UK Biobank data

To compare GPLEMMA and LEMMA on real data we ran both methods on body mass index (log BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP) and pulse pressure (PP) measured on individuals from the UK Biobank. We filtered the SNP genotype data based on minor allele frequency (≥ 0.01) and IMPUTE info score (≥ 0.3), leaving approximately 642,000 variants per trait. We used 42 environmental variables from the UK Biobank, similar to those used in previous GxE analyses of BMI in the UK Biobank 28, 37. After filtering on ancestry and relatedness, sub-setting down to individuals who had complete data across the phenotype, covariates and environmental factors we were left with approximately 280, 000 samples per trait. The sample, SNP and covariate processing and filtering applied is the same as that reported in the LEMMA paper29.

Table 1 shows the estimates and standard errors for SNP main effects Embedded Image and GxE effects Embedded Image for GPLEMMA and LEMMA applied to the 4 traits. In all cases there is good agreements between the estimates from both methods.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1: Comparison of GPLEMMA and LEMMA on 4 UK Biobank traits.

Heritability estimates obtained using genotyped SNPs.

Discussion

Primarily this paper develops a novel randomized Haseman-Elston non-linear regression approach for modelling GxE interactions in large genetic studies with multiple environmental variables. This approach estimates GxE heritability at the same time as estimating the linear combination of environmental variables (called an ES) that underly that heritability. This general idea was pioneered in our previous approach LEMMA 29 which used a whole-genome regression approach to learn the ES, and this was then used in a randomized Haseman-Elston approach to estimate GxE heritability. The GPLEMMA approach introduced in this paper does not need that first whole-genome regression step, and this leads to substantial computational savings. The model underlying GPLEMMA is very similar to that in LEMMA, but implicity assumes a Gaussian distribution for main SNP effects and GxE effects at each SNP.

We compared GPLEMMA to a simpler approach, which we called MEMMA, that estimates GxE heritability of each environmental variable in a joint model, but does not attempt to find the best linear combination of them. We found that estimates of GxE heritability from MEMMA had higher variance than estimates from LEMMA and GPLEMMA, suggesting that the usefulness of MEMMA might be limited. Results from LEMMA and GPLEMMA were very similar, both in terms of estimating the ES and GxE heritability. The primary advantage of GPLEMMA over LEMMA is in computational complexity, as the empirical complexity of GPLEMMA appeared to be linear in sample size whereas LEMMA was shown to be super-linear 29.

In the future it may also be interesting to explore the idea of further partitioning variance using multiple orthogonal linear combinations of environmental variables. This could be expressed using the model Embedded Image where ηj = Ewj is an N-vector, wj is an L-vector and Embedded Image.

LEMMA is also able to perform single SNP hypothesis testing whereas GPLEMMA (currently) does not. The linear weighting parameter w from GPLEMMA could be used to initialize LEMMA, or the estimated ES could be used as a single environmental variable in LEMMA. Exploring these, and other, approaches is future work.

Author contributions

J.M. and M.K. conceived the ideas for the model and methods development. M.K. conducted all analyses and developed the software that implemented the methods with guidance from J.M. M.K. and J.M. wrote the manuscript.

Acknowledgements

Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. Financial support was provided by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. We are grateful to Sriram Sankararaman for discussions about the RHE approach during the 2019 CGSI at UCLA.

Footnotes

  • ↵1 A parallel algorithm has perfect strong scaling if the runtime on T processors is linear in Embedded Image, including communication costs.

References

  1. 1.↵
    The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
    OpenUrlCrossRefPubMedWeb of Science
  2. 2.↵
    Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
    OpenUrlCrossRefPubMedWeb of Science
  3. 3.↵
    Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  4. 4.↵
    Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nature Methods 9, 525–526 (2012).
    OpenUrl
  5. 5.
    Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods (2014).
  6. 6.↵
    Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature Genetics 47, 284–290 (2015).
    OpenUrlCrossRefPubMed
  7. 7.↵
    Eskin, E. et al. Efficient Control of Population Structure in Model Organism Association Mapping. Genetics 178, 1709–1723 (2008).
    OpenUrlAbstract/FREE Full Text
  8. 8.
    Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nature Methods 8, 833–835 (2011).
    OpenUrl
  9. 9.↵
    Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–824 (2012).
    OpenUrlCrossRefPubMed
  10. 10.↵
    Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. Improved heritability estimation from genome-wide SNPs. American Journal of Human Genetics (2012).
  11. 11.↵
    Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nature Genetics 50, 737–745 (2018).
    OpenUrlCrossRef
  12. 12.↵
    Speed, D., Cai, N., Johnson, M. R., Nejentsev, S. & Balding, D. J. Reevaluation of SNP heritability in complex human traits. Nature Genetics 49, 986–992 (2017).
    OpenUrlCrossRef
  13. 13.↵
    Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics 47, 1114–1120 (2015).
    OpenUrlCrossRefPubMed
  14. 14.↵
    Hayes, B., Goddard, M. et al. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
    OpenUrlAbstract/FREE Full Text
  15. 15.↵
    de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. & Calus, M. P. L. Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345 (2013).
    OpenUrlAbstract/FREE Full Text
  16. 16.↵
    Logsdon, B. A., Hoffman, G. E. & Mezey, J. G. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinformatics 11 (2010).
  17. 17.↵
    Carbonetto, P. & Stephens, M. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7, 73–108 (2012).
    OpenUrl
  18. 18.↵
    Zhou, X., Carbonetto, P. & Stephens, M. Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLoS Genetics 9, e1003264 (2013).
    OpenUrl
  19. 19.↵
    Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nature Genetics 51, 277–284 (2019).
    OpenUrl
  20. 20.↵
    Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics 47, 291–295 (2015).
    OpenUrlCrossRefPubMed
  21. 21.↵
    Haseman, R., J.K. & Elston. The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2 (1972).
  22. 22.↵
    Wu, Y. & Sankararaman, S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics 34, i187–i194 (2018).
    OpenUrl
  23. 23.↵
    Pazokitoroudi, A. et al. Scalable multi-component linear mixed models with application to SNP heritability estimation. bioRxiv 522003 (2019).
  24. 24.↵
    Young, A. I. et al. Relatedness disequilibrium regression estimates heritability without environmental bias. Nature Genetics 50, 1304–1310 (2018).
    OpenUrlCrossRefPubMed
  25. 25.↵
    Heckerman, D. et al. Linear mixed model for heritability estimation that explicitly addresses environmental variation. Proceedings of the National Academy of Sciences 113, 7377–7382 (2016).
    OpenUrlAbstract/FREE Full Text
  26. 26.↵
    Ober, U. et al. Accounting for Genetic Architecture Improves Sequence Based Genomic Prediction for a Drosophila Fitness Trait. PLOS ONE 10, e0126880 (2015).
    OpenUrlCrossRefPubMed
  27. 27.↵
    Crawford, L., Zeng, P., Mukherjee, S. & Zhou, X. Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLOS Genetics 13, e1006869 (2017).
    OpenUrl
  28. 28.↵
    Moore, R. et al. A linear mixed model approach to study multivariate gene-environment interactions. Nature Genetics 51, 180–186 (2018).
    OpenUrl
  29. 29.↵
    Kerin, M. & Marchini, J. Gene-environment interactions using a bayesian whole genome regression model. bioRxiv (2019). URL https://doi.org/10.1101/797829.
  30. 30.↵
    Powell, J. E. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature Genetics 50, 746–753 (2018).
    OpenUrlCrossRef
  31. 31.↵
    Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics 47, 1228–1235 (2015).
    OpenUrlCrossRefPubMed
  32. 32.↵
    Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: Inferring the contribution of common variants. Proceedings of the National Academy of Sciences 111, E5272–E5281 (2014).
    OpenUrlAbstract/FREE Full Text
  33. 33.↵
    Yang, J., Zeng, J., Goddard, M. E., Wray, N. R. & Visscher, P. M. Concepts, estimation and interpretation of SNP-based heritability. Nature Genetics 49, 1304–1310 (2017).
    OpenUrlCrossRefPubMed
  34. 34.↵
    Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics -Simulation and Computation 19, 433–450 (1990).
    OpenUrl
  35. 35.↵
    Zolfaghari, A. et al. An algorithm for the least-squares estimation of nonlinear parameters. International Journal of Soil Science 3, 270–277 (2005).
    OpenUrl
  36. 36.↵
    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
    OpenUrlCrossRefPubMed
  37. 37.↵
    Young, A. I., Wauthier, F. & Donnelly, P. Multiple novel gene-by-environment interactions modify the effect of FTO variants on body mass index. Nature Communications 7, 12724 (2016).
    OpenUrl
Back to top
PreviousNext
Posted May 19, 2020.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Non-linear randomized Haseman-Elston regression for estimation of gene-environment heritability
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Non-linear randomized Haseman-Elston regression for estimation of gene-environment heritability
Matthew Kerin, Jonathan Marchini
bioRxiv 2020.05.18.098459; doi: https://doi.org/10.1101/2020.05.18.098459
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Non-linear randomized Haseman-Elston regression for estimation of gene-environment heritability
Matthew Kerin, Jonathan Marchini
bioRxiv 2020.05.18.098459; doi: https://doi.org/10.1101/2020.05.18.098459

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4095)
  • Biochemistry (8788)
  • Bioengineering (6493)
  • Bioinformatics (23395)
  • Biophysics (11766)
  • Cancer Biology (9171)
  • Cell Biology (13292)
  • Clinical Trials (138)
  • Developmental Biology (7423)
  • Ecology (11389)
  • Epidemiology (2066)
  • Evolutionary Biology (15121)
  • Genetics (10415)
  • Genomics (14026)
  • Immunology (9152)
  • Microbiology (22111)
  • Molecular Biology (8793)
  • Neuroscience (47460)
  • Paleontology (350)
  • Pathology (1423)
  • Pharmacology and Toxicology (2486)
  • Physiology (3712)
  • Plant Biology (8069)
  • Scientific Communication and Education (1433)
  • Synthetic Biology (2216)
  • Systems Biology (6022)
  • Zoology (1251)