Efficient Bayesian mixed model analysis increases association power in large cohorts

Po-Ru Loh; George Tucker; Brendan K Bulik-Sullivan; Bjarni J Vilhjálmsson; Hilary K Finucane; Daniel I Chasman; Paul M Ridker; Benjamin M Neale; Bonnie Berger; Nick Patterson; Alkes L Price

doi:10.1101/007799

Abstract

Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts, and may not optimize power. All existing methods require time cost O(MN²) (where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN) iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for GWAS in large cohorts.

Linear mixed models are emerging as the method of choice for association testing in genomewide association studies (GWAS) because they account for both population stratification and cryptic relatedness and achieve increased statistical power by jointly modeling all genotyped markers [1–12]. However, existing mixed model methods still have limitations. First, mixed model analysis is computationally expensive. Despite a series of recent algorithmic advances, current algorithms require O(MN²) total running time (assuming N < M), where M is the number of markers and N is the sample size, a cost that is becoming prohibitive for large cohorts [12]. Second, current mixed model methods fall short of achieving maximal statistical power owing to suboptimal modeling assumptions regarding the genetic architectures underlying phenotypes. The standard linear mixed model implicitly assumes that all variants are causal with small effect sizes drawn from independent Gaussian distributions—the “infinitesimal model”—whereas in reality, complex traits have been estimated to have on the order of only a few thousand causal loci [13, 14].

Methodologically, efforts to more accurately model non-infinitesimal genetic architectures have followed two general thrusts. One approach is to apply the standard infinitesimal mixed model but adapt the input data. For example, large-effect loci can be explicitly identified and conditioned out as fixed effects [7], or the mixed model can be applied to only a selected subset of markers [9, 11, 15, 16]. A more flexible alternative approach is to adapt the mixed model itself by taking a Bayesian perspective and modeling SNP effects with non-Gaussian prior distributions that better accommodate both small- and large-effect loci. Such methods were pioneered in livestock genetics to improve prediction of genetic values [17] and have been extensively developed in the plant and animal breeding literature for the purpose of genomic selection [18]. While modeling methods that improve prediction should in theory enable corresponding improvements in statistical power of association analysis (via conditioning on other associated loci when testing a candidate marker [9, 12]), a challenge of applying Bayesian methods in the GWAS setting is that Bayesian statistics are not readily interpretable in the customary hypothesis testing framework.

Here, we present an algorithm that performs mixed model analysis in a small number of O(MN)-time iterations and increases power by modeling non-infinitesimal genetic architectures. Our algorithm fits a Gaussian mixture model of SNP effects [19], using a fast variational approximation [20–22] to compute approximate phenotypic residuals, and tests the residuals for association with candidate markers via a retrospective score statistic [23] that provides a bridge between Bayesian modeling for phenotype prediction and the frequentist association testing framework. We calibrate our statistic using an approach based on the recently developed LD Score regression technique [24]. The entire procedure operates directly on raw genotypes stored compactly in memory and does not require computing or storing a genetic relationship matrix. In the special case of the infinitesimal model, we achieve results equivalent to existing methods at dramatically reduced time and memory cost.

We provide an efficient software implementation of our algorithm, BOLT-LMM, and demonstrate its computational efficiency on simulated data sets of up to 480,000 individuals. Our simulations also show that BOLT-LMM achieves increased association power over standard infinitesimal mixed model analysis of traits driven by a few thousand causal SNPs. We applied BOLT-LMM to perform mixed model analysis of nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) [25] and observed increased association power equivalent to an up to 10% increase in effective sample size. We demonstrate through theory and simulations that the boost in power increases with cohort size, making BOLT-LMM a promising approach for largescale GWAS.

Results

Overview of Methods

The BOLT-LMM algorithm consists of four main steps, each of which run in a small number of O(MN)-time iterations. These steps are: (1a) Estimate variance parameters; (1b) Compute infinitesimal mixed model association statistics (denoted BOLT-LMM-inf); (2a) Estimate Gaussian mixture parameters; (2b) Compute Gaussian mixture model association statistics (BOLT-LMM). Step 1a computes results nearly identical to standard variance components analysis but applies a new stochastic approximation algorithm that reduces time and memory cost by circumventing spectral decomposition, which is expensive for large sample sizes. Instead, the approximation algorithm only requires solving linear systems of mixed model equations, which can be accomplished efficiently using conjugate gradient iteration [26, 27]. Step 1b likewise circumvents spectral decomposition by introducing a new retrospective mixed model association statistic similar to GRAMMAR-Gamma [10] and MASTOR [23], which we compute—up to a calibration constant— using only solutions to linear systems of equations. We estimate the calibration constant by computing and comparing the new statistic and the standard prospective mixed model statistic at a random subset of SNPs, which can likewise be accomplished efficiently using conjugate gradient iteration. This procedure is similar in spirit to GRAMMAR-Gamma calibration but requires only O(MN)-time iterations.

Steps 2a and 2b are Gaussian mixture parallels of steps 1a and 1b. BOLT-LMM’s non-infinitesimal model amounts to a relaxation of the standard mixed model, which from a Bayesian perspective imposes a Gaussian prior distribution on SNP effect sizes. BOLT-LMM relaxes this modeling assumption by allowing the prior distribution to be a mixture of two Gaussians, giving the model greater flexibility to accommodate large-effect SNPs while maintaining effective modeling of genome-wide effects (e.g., ancestry). For the Gaussian mixture model, it is no longer tractable to perform exact posterior inference, so BOLT-LMM instead computes a variational approximation [20–22] that converges after a small number of O(MN)-time iterations. Step 2a applies this method within 5-fold cross-validation to estimate best-fit parameters for the prior distribution (taking into account variance parameters estimated in step 1a) based on out-of-sample prediction accuracy. If the prediction accuracy of the best-fit Gaussian mixture model exceeds that of the infinitesimal model by at least a specified amount, step 2b is then run to compute association statistics by testing each SNP against the residual phenotype obtained from the Gaussian mixture model and calibrating the test statistics against the results of step 1b using LD Score regression [24]. Otherwise, the BOLT-LMM association statistic is the same as BOLT-LMM-inf. Both step 1b and step 2b are performed using a leave-one-chromosome-out (LOCO) scheme to avoid proximal contamination [9,12]. Further details of the method are provided in Online Methods and the Supplementary Note. The key properties of BOLT-LMM in terms of speed and model specification are compared to existing mixed model association methods in Table 1.

View this table:

Table 1.

Comparison of fast mixed model association methods that model all SNPs.

BOLT-LMM is much more computationally efficient than existing methods

To analyze the computational performance of BOLT-LMM, we simulated data sets of sizes ranging up to N = 3,750 to 480,000 individuals and M = 300,000 SNPs. We used genotypes from the WTCCC2 data set [28] analyzed in ref. [12], which contains 15,633 individuals of European ancestry, to form mosaic chromosomes, and we used a phenotype model in which 5,000 SNPs explained 20% of phenotypic variance (Online Methods).

We benchmarked BOLT-LMM against existing mixed model association methods, running each method for up to 10 days on machines with 96GB of memory. BOLT-LMM completed all analyses through N = 480,000 individuals within these constraits, whereas previous methods could only analyze a maximum of N = 7,500–30,000 individuals (Fig. 1). All previous methods require O(MN²) running time, whereas BOLT-LMM requires only ≈O(MN^1.5) time (Fig. 1a and Supplementary Fig. 1a). We also observed substantial savings in memory use with BOLT-LMM, which requires little more than the MN/4 bytes of memory needed to store raw genotypes, much less than existing mixed model methods (Fig. 1b and Supplementary Fig. 1b).

Figure 1. Computational performance of mixed model association methods.

Log-log plots of (a) run time and (b) memory as a function of sample size (N). Slopes of the curves correspond to exponents of power-law scaling with N. Benchmarking was performed on simulated data sets in which each sample was generated as a mosaic of genotype data from 2 random “parents” from the WTCCC2 data set (N = 15,633, M = 360 K) and phenotypes were simulated with M_causal = 5000 SNPs explaining h²_causal = 0.2 of phenotypic variance. Reported run times are medians of five identical runs using one core of a 2.27 GHz Intel Xeon L5640 processor. We caution that running time comparisons may vary by a small constant factor as a function of computing environment. FaST-LMM-Select (resp. GCTA-LOCO, EMMAX) memory usage exceeded the 96GB available at N = 15 K (resp. 30 K, 60 K). GEMMA encountered a runtime error (segmentation fault) at N = 30 K. Software versions: FaST-LMM-Select, v2.07; GCTA-LOCO, v1.24; EMMAX, v20120210; GEMMA, v0.94. Numerical data are provided in Supplementary Table 1.

The running time of BOLT-LMM depends not only on the cost of matrix arithmetic, which scales linearly with M and N, but also the number of O(MN)-time iterations required for convergence, which is roughly O(N^0.5) and also varies with heritability, relatedness, and population structure (see Supplementary Note, Supplementary Fig. 1 and Supplementary Fig. 2). These observations apply both to the full Gaussian mixture modeling performed by BOLT-LMM and to the subset of the computation (steps 1a and 1b) needed to compute BOLT-LMM-inf infinitesimal mixed model association statistics, which in our benchmarks required about 40% of the full BOLT-LMM run time (Fig. 1a and Supplementary Fig. 1a). Our results show that even on very large data sets, BOLT-LMM is efficient enough to enable mixed model analysis using a Gaussian mixture prior, which we recommend because of its potential to increase power.

Simulations: BOLT-LMM increases power while controlling false positives

To assess the power of BOLT-LMM to detect associated loci, we performed additional simulations using real genotypes from the WTCCC2 data set, which is an ancestry-stratified sample containing both Northern and Southern European samples. We simulated phenotypes with 1250– 10,000 causal SNPs [13,14] explaining 50% of phenotypic variance and an additional 60 candidate causal SNPs explaining 2% of the variance. We further introduced environmental differences in ancestry by including a component of phenotype aligned with the top principal component that explained an additional 1% of the variance. (We note that principal component analysis is not part of BOLT-LMM; our recommendation, consistent with the recommendation of ref. [12], is that it is not necessary to perform PCA when running mixed model association methods.) We chose causal SNPs randomly from the first halves of chromosomes, leaving the second halves of chromosomes to contain only non-causal SNPs (Online Methods).

We computed χ² association statistics using linear regression with 10 principal components (PCA) [29], GCTA-LOCO [12], BOLT-LMM-inf, and BOLT-LMM. We were unable to run FaSTLMM-Select [15] because of its memory requirements (Fig. 1). For each method, we computed means of its χ² statistics over candidate causal SNPs and compared these means across simulation setups involving different numbers of causal SNPs (Fig. 2a). We observed that BOLT-LMM achieved power gains by modeling non-infinitesimal architectures. For the sparsest genetic architecture (1250 causal SNPs plus 60 causal candidate SNPs), we observed a 25% increase in mean BOLT-LMM χ² statistics at candidate SNPs compared to GCTA-LOCO and BOLT-LMM-inf infinitesimal mixed model χ² statistics. This metric is readily interpretable as corresponding to a 25% increase in effective sample size; for completeness, we also computed traditional power curves at two significance thresholds (Supplementary Fig. 4). The power gain of the Gaussian mixture model decreased with increasing numbers of causal SNPs (Fig. 2a). This behavior is expected because the advantage of the Gaussian mixture lies in its ability to more accurately model a small fraction of SNPs with larger effects amid a majority of SNPs with near-zero effects. Larger numbers of causal SNPs explaining a fixed proportion of variance result in smaller effect sizes per causal SNP, giving BOLT-LMM less opportunity for power gain. In contrast, all methods other than BOLT-LMM had performance independent of the number of causal SNPs, consistent with the fact that none of these methods model non-infinitesimal genetic architectures. GCTA-LOCO and BOLT-LMM-inf mean χ² statistics at candidate causal SNPs were essentially identical and slightly exceeded PCA, consistent with theory [12]. We also tested EMMAX [3] and GEMMA [6], which are vulnerable to proximal contamination [9, 12]; these methods suffered loss of power relative to PCA (Supplementary Fig. 3a), consistent with theory [12].

Figure 2. BOLT-LMM increases power to detect associations in simulations.

Mean χ² at causal candidate SNPs as a function of (a) number of causal SNPs, (b) proportion of variance explained by causal SNPs, (c) number of samples. Simulations used real genotypes from the WTCCC2 data set (N = 15,633, M = 360 K) and simulated phenotypes with the specified number of causal SNPs explaining the specified proportion of phenotypic variance and 60 more candidate SNPs explaining an additional 2% of the variance. Error bars, s.e.m., 100 simulations. We verified on the first 5 simulations that the BOLT-LMM-inf and GCTA-LOCO statistics are nearly identical (Supplementary Table 6). Numerical data are provided in Supplementary Table 2.

To further explore the relationship between the magnitude of Gaussian mixture model power gain and other parameters of the data set, we also varied the proportion of variance explained by causal SNPs (Fig. 2b), and the number of samples in our simulations (Fig. 2c). We observed that the boost in power of BOLT-LMM over infinitesimal mixed model analysis (GCTA-LOCO, BOLT-LMM-inf) increased with each of these parameters. In further simulations using data sets of size N = 30,000 and N = 60,000 (Online Methods) and simulated phenotypes with M_causal = 250–15,000 causal SNPs explaining 15–35% of the variance, we observed that the effectiveness of the Gaussian mixture model is closely tied to (where is heritability explained by genotyped SNPs); intuitively, this quantity measures the effective number of samples per causal SNP (Supplementary Fig. 5). These results are consistent with theory, which explains that in the absence of confounding, both infinitesimal and Gaussian mixture model analysis provide a power gain over marginal regression by conditioning on the estimated effects of other SNPs when testing a candidate SNP [9, 12]. As sample size increases, the power gain of both methods approaches an asymptote corresponding to an increase in effective sample size of but for sparse genetic architectures, the Gaussian mixture model can approach this asymptote much faster.

To verify that BOLT-LMM is correctly calibrated and robust to confounding, we also computed mean χ² statistics across SNPs on the second halves of chromosomes, simulated to all have zero effect (“null SNPs”). Because our simulated phenotypes included an ancestry effect, linear regression without correcting for population stratification suffered 35% inflation. In contrast, the BOLT-LMM and BOLT-LMM-inf statistics were both well-calibrated (Supplementary Fig. 3b, Supplementary Table 3, and Supplementary Table 4). We further verified that Type I error was properly controlled (Online Methods and Supplementary Table 5) and that the distribution of statistics at null SNPs did not deviate noticeably from a chi-square with 1 degree of freedom (Supplementary Fig. 6a,b). Genomic inflation factors [30]) for BOLT-LMM and BOLT-LMM-inf exceeded 1 in these simulations (Supplementary Fig. 6c,d), consistent with polygenicity of the simulated phenotype and use of a mixed model statistic that successfully avoids proximal contamination [12, 13]. In contrast, EMMAX and GEMMA had deflated test statistics (Supplementary Fig. 3b).

Finally,we investigated the similarity between the BOLT-LMM-inf infinitesimal mixed model statistic versus existing methods at the level of individual SNPs. Despite its use of an infinitesimal model, the BOLT-LMM-inf statistic is not identical to any existing mixed model statistic because it is an approximate retrospective test statistic that avoids proximal contamination (Online Methods and Table 1). Nonetheless, we observed that BOLT-LMM-inf statistics very nearly match GCTA-LOCO statistics (which use the standard prospective model), with R² > 0.999 (Supplementary Table 6 and Supplementary Fig. 7).

BOLT-LMM increases power to detect associations for WGHS phenotypes

To assess the efficacy of Gaussian mixture model analysis for increasing power on real phenotypes, we analyzed nine phenotypes in the Women’s Genome Health Study (N = 23,294 samples, M = 324,488 SNPs after QC) (Online Methods). These phenotypes consisted of five lipid phenotypes, height, body mass index, and two blood pressure phenotypes; we chose to analyze these phenotypes because of the availability of published lists of associations from large-scale GWAS of these traits.

We compared the power of three association tests: linear regression with 10 principal components (PCA) [29], infinitesimal mixed model analysis with BOLT-LMM-inf, and Gaussian mixture modeling with BOLT-LMM. Because of memory constraints (Fig. 1), we were unable to run either GCTA-LOCO [12], FaST-LMM [5], or FaST-LMM-Select [15], which are the only previous methods that avoid proximal contamination (Table 1); however, GCTA-LOCO and BOLT-LMM-inf statistics are essentially identical (Supplementary Table 6 and Supplementary Fig. 7). To compare power among these methods, we computed two roughly equivalent metrics: mean χ² statistics at known associated loci, a direct but somewhat noisy approach due to having only 19–180 loci for each trait (Supplementary Table 7), and out-of-sample prediction R² (measured in cross-validation) using all SNPs for the mixed model methods and using only PCs for linear regression. For mixed model analysis, the latter approach estimates the ability of the mixed model to condition on effects of other SNPs when testing a candidate SNP, which drives its power (Online Methods) [12, 31].

BOLT-LMM achieved higher power than PCA for all traits studied (Fig. 3 and Supplementary Table 8). Most of the increase was due to gains over infinitesimal mixed model analysis, with the magnitude of this power gain increasing with inferred concentration of genetic effects at few loci (Supplementary Table 9). The standard errors of the direct method of assessing improvement (mean χ² at known loci) were somewhat high (0.6–2.2%; Fig. 3a and Supplementary Table 8), so the improvement was statistically significant (p < 0.05) for only 6 of the 9 traits. On the other hand, all of the improvements were statistically significant (p < 0.0002) according to the prediction R² metric (Fig. 3b and Supplementary Table 8). The largest gains were achieved for lipid traits; for ApoB, a lipoprotein closely related to LDL cholesterol, BOLT-LMM analysis achieved a 10% increase in mean χ² statistics versus PCA and a 9% increase versus infinitesimal mixed model analysis at known loci. To verify that these increases were not merely driven by a few loci with the largest effects, we also computed flat averages across loci of improvements in χ² statistics (restricting to loci replicating in WGHS with at least nominal p < 0.05 significance to reduce statistical noise), obtaining consistent results (Supplementary Table 7). As noted above, simulations show that these improvements will increase with sample size (Fig. 2c and Supplementary Fig. 5).

Figure 3. BOLT-LMM increases power to detect associations for WGHS phenotypes.

We compare power (measured using two roughly equivalent metrics) of linear regression using 10 principal components, standard (infinitesimal) mixed model analysis, and BOLT-LMM Gaussian mixture model analysis. (a) Percent increases in χ² statistics across known loci using mixed model methods vs. PCA: ratios of sums of χ² statistics over typed SNPs in highest LD with published associated SNPs. (b) Prediction R² values from 5-fold cross-validation: each fold was left out in turn and predictions were computed by fitting all SNP effects simultaneously (for mixed model methods) or estimating covariate effects (for PCA) using the training folds. The correspondence between association power and prediction accuracy is such that the red bars in (a) roughly correspond to differences between red and black bars in (b), and analogously for blue bars (Online Methods). Error bars, jackknife s.e. Numerical data are provided in Supplementary Table 8

We also observed that infinitesimal mixed model analysis achieved statistically significant power gains over PCA, with the magnitude of the power gains increasing with the heritability parameter estimated by BOLT-LMM (Fig. 3 and Supplementary Table 8), which we refer to as pseudo-heritability see Online Methods), following ref. [3]. For height in WGHS), the moderately large sample size of WGHS (N = 23,294) was enough to obtain a 6% increase in BOLT-LMM-inf χ² statistics versus PCA, consistent with theory [12, 31]. Once again, larger sample sizes will enable further gains [12, 31].

To verify that BOLT-LMM successfully corrected for confounding from population structure in WGHS, we computed mean χ² statistics across all typed SNPs and genomic inflation factors for the three methods compared above as well as uncorrected marginal linear regression. We observed that PCA, BOLT-LMM-inf, and BOLT-LMM statistics were consistently calibrated, while uncorrected linear regression statistics were inflated, especially for height (Supplementary Table 10). We further verified that genetic variation at the lactase gene had a false-positive genome-wide significant association with height using uncorrected marginal regression [32], which disappeared when using PCA, BOLT-LMM-inf, and BOLT-LMM (Supplementary Table 11).

Discussion

We have described a new algorithm for fast Bayesian mixed model association, BOLT-LMM, and demonstrated that it has time complexity ≈O(MN^1.5) and requires only ≈MN/4 bytes of memory, resulting in orders-of-magnitude improvements in computational efficiency over existing methods for large data sets. We have further shown in simulations and analyses of WGHS phenotypes that the Gaussian mixture modeling capability of BOLT-LMM enables increased association power over standard mixed model analysis while controlling false positives. Among WGHS lipid traits, we observed power increases equivalent to increases in effective sample size of up to 10% over PCA and 9% over standard mixed model analysis.

BOLT-LMM is an advance for two main reasons. First, as sample sizes continue to increase, mixed model analysis is simultaneously becoming more important—in order to correct for population structure and cryptic relatedness in very large data sets—but lees practical with existing methods, all of which have ≥ O(MN²) time complexity and high memory requirements. The algorithmic innovations of BOLT-LMM overcome this computational barrier. Second, the ability of BOLT-LMM to better model non-infinitesimal genetic architectures enables a power gain relative to standard mixed model analysis. Recent methodological progress in this direction includes the multi-locus mixed model (MLMM) [7], which identifies and conditions out large-effect loci as fixed effects, and FaST-LMM-Select and related methods [9, 11, 15, 16, 33], which adopt a sparse regression framework that restricts the mixed model to a subset of markers. However, these methods all face the same O(MN²) computational hurdle as standard mixed model analysis.

Bayesian methods have previously been developed that apply non-infinitesimal models to improve the accuracy of genetic risk prediction, but translating Bayes factors and posterior inclusion probabilities into the frequentist hypothesis testing framework favored by the GWAS community is a challenge [34]. The variational Bayes spike regression (vBsr) method [35] is a recent step toward addressing this issue, proposing a z-statistic heuristically calibrated by assuming that the vast majority of variants are unassociated (as in genomic control [30]), but such a technique is prone to deflation when large sample sizes cause inflation due to polygenicity [13, 24]. BOLT-LMM sidesteps this difficulty via its hybrid approach of leaving each chromosome out in turn, fitting a Bayesian model on the remaining SNPs, and then applying a retrospective hypothesis test for association of left-out SNPs with the residual phenotype. In contrast to than modeling all SNPs simultaneously and assessing evidence for association using Bayesian posterior inference [34], our approach generalizes existing mixed model methods that are widely used, and we believe its ability to harness the power of Bayesian analysis while still computing frequentist statistics will be useful to GWAS practitioners. Additionally, such a hybrid approach lends itself readily to efficiently testing millions of imputed SNP dosages for association while including only typed SNPs in the mixed model, which we recommend in order to limit computational costs.

While BOLT-LMM improves upon existing mixed model association methods in both speed and power, BOLT-LMM still has limitations. First, the power gain that BOLT-LMM offers over existing methods via its more flexible prior on SNP effect sizes is contingent on the true genetic architecture being sufficiently non-infinitesimal and the sample size being sufficiently large (Supplementary Fig. 5). Second, BOLT-LMM, like existing mixed model methods, is susceptible to loss of power when used to analyze large ascertained case-control data sets in diseases of low prevalence [12]. We recommend BOLT-LMM for randomly ascertained quantitative traits, ascertained case-control studies of diseases with prevalence ≥5% (e.g., type 2 diabetes, heart disease, common cancers, hypertension, asthma) (see Supplementary Table 12), and studies of rarer diseases in large, non-ascertained population cohorts [36, 37]. For large ascertained case-control studies of rarer diseases, we are developing a method of modeling ascertainment using posterior mean liabilities [38]; applying the techniques of BOLT-LMM to these posterior mean liabilities is an avenue for future research. Third, while mixed model analysis is effective in correcting for many forms of confounding, performing careful data quality control remains critical to avoiding false positives. Fourth, our work does note estimate the heritability explained by genotyped SNPs ref. [39])—because may be different from (see Online Methods)—and does not conduct or evaluate genetic prediction in external validation samples from an independent cohort [31]. Fifth, we have not studied the performance of mixed model methods in data sets dominated by family structure [23]; this will be investigated elsewhere. Sixth, while BOLT-LMM extends the infinitesimal model by generalizing the SNP effect prior to a mixture of two Gaussians, other priors are possible and may be more appropriate for some genetic architectures (Table 1 of ref. [19]). Seventh, the running time of BOLT-LMM scales with the number of phenotypes analyzed; for data sets with a very large number of phenotypes (P), the GRAMMAR-Gamma method [10], which has running time O(MN² + MNP) (reviewed in ref. [12]) may be faster. Finally, we have developed fast mixed model analysis for a mixed model with one random genetic effect; extending the algorithm to model multiple variance components [40] is a direction for future work.

URLs. BOLT-LMM software, http://www.hsph.harvard.edu/alkes-price/software/.

Acknowledgments

We are grateful to M. Lipson, S. Simmons, A. Gusev, K. Galinsky, J. Yang, P. Visscher, and Z. Zhu for helpful discussions. This research was funded by NIH grant R01 HG006399 and NIH fellowship F32 HG007805. The WGHS is supported by HL043851 and HL080467 from the National Heart, Lung, and Blood Institute and CA047988 from the National Cancer Institute, the Donald W. Reynolds Foundation and the Fondation Leducq, with collaborative scientific support and funding for genotyping provided by Amgen.

References

1.↵
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38, 203–208 (2006).
OpenUrl CrossRef PubMed Web of Science
2.
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
OpenUrl Abstract/FREE Full Text
3.↵
Kang, H. M. et al. Variance component model to account for sample structure in genomewide association studies. Nature Genetics 42, 348–354 (2010).
OpenUrl CrossRef PubMed Web of Science
4.
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nature Genetics 42, 355–360 (2010).
OpenUrl CrossRef PubMed Web of Science
5.↵
Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nature Methods 8, 833–835 (2011).
OpenUrl
6.↵
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–824 (2012).
OpenUrl CrossRef PubMed
7.↵
Segura, V. et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nature Genetics 44, 825–830 (2012).
OpenUrl CrossRef PubMed
8.
Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature Genetics 44, 1066–1071 (2012).
OpenUrl CrossRef PubMed
9.↵
Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nature Methods 9, 525–526 (2012).
OpenUrl
10.↵
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nature Genetics (2012).
11.↵
Listgarten, J., Lippert, C. & Heckerman, D. FaST-LMM-Select for addressing confounding from spatial structure and rare variants. Nature Genetics 45, 470–471 (2013).
OpenUrl CrossRef PubMed
12.↵
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nature Genetics 46, 100– 106 (2014).
OpenUrl CrossRef PubMed
13.↵
Yang, J. et al. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics 19, 807–812 (2011).
OpenUrl CrossRef PubMed
14.↵
Stahl, E. A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature Genetics 44, 483–489 (2012).
OpenUrl CrossRef PubMed
15.↵
Lippert, C. et al. The benefits of selecting phenotype-specific variants for applications of mixed models in genomics. Scientific Reports 3 (2013).
16.↵
Rakitsch, B., Lippert, C., Stegle, O. & Borgwardt, K. A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29, 206–214 (2013).
OpenUrl CrossRef PubMed
17.↵
Meuwissen, T., Hayes, B. & Goddard, M. Prediction of total genetic value using genomewide dense marker maps. Genetics 157, 1819–1829 (2001).
OpenUrl Abstract/FREE Full Text
18.↵
de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. & Calus, M. P. Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345 (2013).
OpenUrl Abstract/FREE Full Text
19.↵
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9, e1003264 (2013).
OpenUrl CrossRef PubMed
20.↵
Meuwissen, T., Solberg, T. R., Shepherd, R. & Woolliams, J. A. A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol 41 (2009).
21.↵
Carbonetto, P. & Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7, 73–108 (2012).
OpenUrl
22.↵
Logsdon, B. A., Hoffman, G. E. & Mezey, J. G. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinformatics 11, 58 (2010).
23.↵
Jakobsdottir, J. & McPeek, M. S. MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals. American Journal of Human Genetics 92, 652–666 (2013).
OpenUrl
24.↵
Bulik-Sullivan, B. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. bioRxiv (2014).
25.↵
Ridker, P. M. et al. Rationale, design, and methodology of the Women’s Genome Health Study: a genome-wide association study of more than 25,000 initially healthy American women. Clinical Chemistry 54, 249–255 (2008).
OpenUrl Abstract/FREE Full Text
26.↵
Legarra, A. & Misztal, I. Computing strategies in genome-wide selection. Journal of Dairy Science 91, 360–366 (2008).
OpenUrl CrossRef PubMed Web of Science
27.↵
VanRaden, P. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423 (2008).
OpenUrl CrossRef PubMed Web of Science
28.↵
Sawcer, S. et al. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214 (2011).
29.↵
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904–909 (2006).
OpenUrl CrossRef PubMed Web of Science
30.↵
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
OpenUrl CrossRef PubMed Web of Science
31.↵
Wray, N. R. et al. Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics 14, 507–515 (2013).
OpenUrl CrossRef PubMed
32.↵
Campbell, C. D. et al. Demonstrating stratification in a European American population. Nature Genetics 37, 868–872 (2005).
OpenUrl CrossRef PubMed Web of Science
33.↵
Tucker, G., Price, A. L. & Berger, B. A. Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics (2014).
34.↵
Stephens, M. & Balding, D. J. Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 10, 681–690 (2009).
OpenUrl CrossRef PubMed Web of Science
35.↵
Logsdon, B. A., Carty, C. L., Reiner, A. P., Dai, J. Y. & Kooperberg, C. A novel variational Bayes multiple locus Z-statistic for genome-wide association studies with Bayesian model averaging. Bioinformatics 28, 1738–1744 (2012).
OpenUrl CrossRef PubMed
36.↵
Styrkarsdottir, U. et al. Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits. Nature (2013).
37.↵
Do, C. B. et al. Web-based genome-wide association study identifies two novel loci and a substantial genetic component for Parkinson’s disease. PLoS Genetics 7, e1002141 (2011).
OpenUrl
38.↵
Hayeck, T. et al. Mixed model with correction for case-control ascertainment increases power in multiple sclerosis association study. Abstract to be presented at the 64th Annual Meeting of The American Society of Human Genetics, October 18–22, 2014, San Diego, CA.
39.↵
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (2010).
OpenUrl CrossRef PubMed Web of Science
40.↵
Speed, D. & Balding, D. J. MultiBLUP: improved SNP-based prediction for complex traits. Genome Research gr–169375 (2014).
41.↵
Chen, W.-M. & Abecasis, G. R. Family-based association tests for genomewide association scans. American Journal of Human Genetics 81, 913–926 (2007).
OpenUrl CrossRef PubMed Web of Science
42.↵
Aulchenko, Y. S., De Koning, D.-J. & Haley, C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177, 577–585 (2007).
OpenUrl Abstract/FREE Full Text
43.↵
Chen, W.-M., Manichaikul, A. & Rich, S. S. A generalized family-based association test for dichotomous traits. American Journal of Human Genetics 85, 364–376 (2009).
OpenUrl CrossRef PubMed Web of Science
44.↵
McCulloch, C., Searle, S. & Neuhaus, J. Generalized, linear, and mixed models (Wiley, 2008), 2nd edn.
45.↵
Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545–554 (1971).
OpenUrl CrossRef Web of Science
46.↵
Boyd, S. P. & Vandenberghe, L. Convex Optimization (Cambridge University Press, 2004).
47.↵
McVean, G. A. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
OpenUrl CrossRef PubMed Web of Science
48.↵
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 43, 519–525 (2011).
OpenUrl CrossRef PubMed
49.↵
Bishop, C. M. et al. Pattern recognition and machine learning, vol. 1 (springer New York, 2006).
50.↵
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 459–463 (2010).
OpenUrl CrossRef PubMed Web of Science
51.
Sul, J. H. & Eskin, E. Mixed models can correct for population structure for genomic regions under selection. Nature Reviews Genetics 14, 300–300 (2013).
OpenUrl
52.↵
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. Response to sul and eskin. Nature Reviews Genetics 14, 300–300 (2013).
OpenUrl
53.
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nature Genetics (2013).
54.
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
OpenUrl CrossRef PubMed Web of Science
55.
Speliotes, E. K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genetics 42, 937–948 (2010).
OpenUrl CrossRef PubMed Web of Science
56.
Ehret, G. B. et al. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature 478, 103–109 (2011).
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted August 09, 2014.

Download PDF

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5209)
Biochemistry (11730)
Bioengineering (8743)
Bioinformatics (29179)
Biophysics (14964)
Cancer Biology (12080)
Cell Biology (17399)
Clinical Trials (138)
Developmental Biology (9417)
Ecology (14174)
Epidemiology (2067)
Evolutionary Biology (18294)
Genetics (12233)
Genomics (16791)
Immunology (11858)
Microbiology (28051)
Molecular Biology (11575)
Neuroscience (60919)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4955)
Plant Biology (10422)
Scientific Communication and Education (1682)
Synthetic Biology (2881)
Systems Biology (7338)
Zoology (1650)

[1] 1.↵
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38, 203–208 (2006).
OpenUrl CrossRef PubMed Web of Science

[2] 2.
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
OpenUrl Abstract/FREE Full Text

[3] 3.↵
Kang, H. M. et al. Variance component model to account for sample structure in genomewide association studies. Nature Genetics 42, 348–354 (2010).
OpenUrl CrossRef PubMed Web of Science

[4] 4.
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nature Genetics 42, 355–360 (2010).
OpenUrl CrossRef PubMed Web of Science

[5] 5.↵
Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nature Methods 8, 833–835 (2011).
OpenUrl

[6] 6.↵
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–824 (2012).
OpenUrl CrossRef PubMed

[7] 7.↵
Segura, V. et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nature Genetics 44, 825–830 (2012).
OpenUrl CrossRef PubMed

[8] 8.
Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature Genetics 44, 1066–1071 (2012).
OpenUrl CrossRef PubMed

[9] 9.↵
Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nature Methods 9, 525–526 (2012).
OpenUrl

[10] 10.↵
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nature Genetics (2012).

[11] 11.↵
Listgarten, J., Lippert, C. & Heckerman, D. FaST-LMM-Select for addressing confounding from spatial structure and rare variants. Nature Genetics 45, 470–471 (2013).
OpenUrl CrossRef PubMed

[12] 12.↵
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nature Genetics 46, 100– 106 (2014).
OpenUrl CrossRef PubMed

[13] 13.↵
Yang, J. et al. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics 19, 807–812 (2011).
OpenUrl CrossRef PubMed

[14] 14.↵
Stahl, E. A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature Genetics 44, 483–489 (2012).
OpenUrl CrossRef PubMed

[15] 15.↵
Lippert, C. et al. The benefits of selecting phenotype-specific variants for applications of mixed models in genomics. Scientific Reports 3 (2013).

[16] 16.↵
Rakitsch, B., Lippert, C., Stegle, O. & Borgwardt, K. A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29, 206–214 (2013).
OpenUrl CrossRef PubMed

[17] 17.↵
Meuwissen, T., Hayes, B. & Goddard, M. Prediction of total genetic value using genomewide dense marker maps. Genetics 157, 1819–1829 (2001).
OpenUrl Abstract/FREE Full Text

[18] 18.↵
de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. & Calus, M. P. Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345 (2013).
OpenUrl Abstract/FREE Full Text

[19] 19.↵
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9, e1003264 (2013).
OpenUrl CrossRef PubMed

[20] 20.↵
Meuwissen, T., Solberg, T. R., Shepherd, R. & Woolliams, J. A. A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol 41 (2009).

[21] 21.↵
Carbonetto, P. & Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7, 73–108 (2012).
OpenUrl

[22] 22.↵
Logsdon, B. A., Hoffman, G. E. & Mezey, J. G. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinformatics 11, 58 (2010).

[23] 23.↵
Jakobsdottir, J. & McPeek, M. S. MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals. American Journal of Human Genetics 92, 652–666 (2013).
OpenUrl

[24] 24.↵
Bulik-Sullivan, B. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. bioRxiv (2014).

[25] 25.↵
Ridker, P. M. et al. Rationale, design, and methodology of the Women’s Genome Health Study: a genome-wide association study of more than 25,000 initially healthy American women. Clinical Chemistry 54, 249–255 (2008).
OpenUrl Abstract/FREE Full Text

[26] 26.↵
Legarra, A. & Misztal, I. Computing strategies in genome-wide selection. Journal of Dairy Science 91, 360–366 (2008).
OpenUrl CrossRef PubMed Web of Science

[27] 27.↵
VanRaden, P. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423 (2008).
OpenUrl CrossRef PubMed Web of Science

[28] 28.↵
Sawcer, S. et al. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214 (2011).

[29] 29.↵
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904–909 (2006).
OpenUrl CrossRef PubMed Web of Science

[30] 30.↵
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
OpenUrl CrossRef PubMed Web of Science

[31] 31.↵
Wray, N. R. et al. Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics 14, 507–515 (2013).
OpenUrl CrossRef PubMed

[32] 32.↵
Campbell, C. D. et al. Demonstrating stratification in a European American population. Nature Genetics 37, 868–872 (2005).
OpenUrl CrossRef PubMed Web of Science

[33] 33.↵
Tucker, G., Price, A. L. & Berger, B. A. Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics (2014).

[34] 34.↵
Stephens, M. & Balding, D. J. Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 10, 681–690 (2009).
OpenUrl CrossRef PubMed Web of Science

[35] 35.↵
Logsdon, B. A., Carty, C. L., Reiner, A. P., Dai, J. Y. & Kooperberg, C. A novel variational Bayes multiple locus Z-statistic for genome-wide association studies with Bayesian model averaging. Bioinformatics 28, 1738–1744 (2012).
OpenUrl CrossRef PubMed

[36] 36.↵
Styrkarsdottir, U. et al. Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits. Nature (2013).

[37] 37.↵
Do, C. B. et al. Web-based genome-wide association study identifies two novel loci and a substantial genetic component for Parkinson’s disease. PLoS Genetics 7, e1002141 (2011).
OpenUrl

[38] 38.↵
Hayeck, T. et al. Mixed model with correction for case-control ascertainment increases power in multiple sclerosis association study. Abstract to be presented at the 64th Annual Meeting of The American Society of Human Genetics, October 18–22, 2014, San Diego, CA.

[39] 39.↵
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (2010).
OpenUrl CrossRef PubMed Web of Science

[40] 40.↵
Speed, D. & Balding, D. J. MultiBLUP: improved SNP-based prediction for complex traits. Genome Research gr–169375 (2014).

[41] 41.↵
Chen, W.-M. & Abecasis, G. R. Family-based association tests for genomewide association scans. American Journal of Human Genetics 81, 913–926 (2007).
OpenUrl CrossRef PubMed Web of Science

[42] 42.↵
Aulchenko, Y. S., De Koning, D.-J. & Haley, C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177, 577–585 (2007).
OpenUrl Abstract/FREE Full Text

[43] 43.↵
Chen, W.-M., Manichaikul, A. & Rich, S. S. A generalized family-based association test for dichotomous traits. American Journal of Human Genetics 85, 364–376 (2009).
OpenUrl CrossRef PubMed Web of Science

[44] 44.↵
McCulloch, C., Searle, S. & Neuhaus, J. Generalized, linear, and mixed models (Wiley, 2008), 2nd edn.

[45] 45.↵
Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545–554 (1971).
OpenUrl CrossRef Web of Science

[46] 46.↵
Boyd, S. P. & Vandenberghe, L. Convex Optimization (Cambridge University Press, 2004).

[47] 47.↵
McVean, G. A. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
OpenUrl CrossRef PubMed Web of Science

[48] 48.↵
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 43, 519–525 (2011).
OpenUrl CrossRef PubMed

[49] 49.↵
Bishop, C. M. et al. Pattern recognition and machine learning, vol. 1 (springer New York, 2006).

[50] 50.↵
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 459–463 (2010).
OpenUrl CrossRef PubMed Web of Science

[51] 51.
Sul, J. H. & Eskin, E. Mixed models can correct for population structure for genomic regions under selection. Nature Reviews Genetics 14, 300–300 (2013).
OpenUrl

[52] 52.↵
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. Response to sul and eskin. Nature Reviews Genetics 14, 300–300 (2013).
OpenUrl

[53] 53.
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nature Genetics (2013).

[54] 54.
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
OpenUrl CrossRef PubMed Web of Science

[55] 55.
Speliotes, E. K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genetics 42, 937–948 (2010).
OpenUrl CrossRef PubMed Web of Science

[56] 56.
Ehret, G. B. et al. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature 478, 103–109 (2011).
OpenUrl CrossRef PubMed Web of Science