Abstract
Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts, and may not optimize power. All existing methods require time cost O(MN2) (where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN) iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for GWAS in large cohorts.
Linear mixed models are emerging as the method of choice for association testing in genomewide association studies (GWAS) because they account for both population stratification and cryptic relatedness and achieve increased statistical power by jointly modeling all genotyped markers [1–12]. However, existing mixed model methods still have limitations. First, mixed model analysis is computationally expensive. Despite a series of recent algorithmic advances, current algorithms require O(MN2) total running time (assuming N < M), where M is the number of markers and N is the sample size, a cost that is becoming prohibitive for large cohorts [12]. Second, current mixed model methods fall short of achieving maximal statistical power owing to suboptimal modeling assumptions regarding the genetic architectures underlying phenotypes. The standard linear mixed model implicitly assumes that all variants are causal with small effect sizes drawn from independent Gaussian distributions—the “infinitesimal model”—whereas in reality, complex traits have been estimated to have on the order of only a few thousand causal loci [13, 14].
Methodologically, efforts to more accurately model non-infinitesimal genetic architectures have followed two general thrusts. One approach is to apply the standard infinitesimal mixed model but adapt the input data. For example, large-effect loci can be explicitly identified and conditioned out as fixed effects [7], or the mixed model can be applied to only a selected subset of markers [9, 11, 15, 16]. A more flexible alternative approach is to adapt the mixed model itself by taking a Bayesian perspective and modeling SNP effects with non-Gaussian prior distributions that better accommodate both small- and large-effect loci. Such methods were pioneered in livestock genetics to improve prediction of genetic values [17] and have been extensively developed in the plant and animal breeding literature for the purpose of genomic selection [18]. While modeling methods that improve prediction should in theory enable corresponding improvements in statistical power of association analysis (via conditioning on other associated loci when testing a candidate marker [9, 12]), a challenge of applying Bayesian methods in the GWAS setting is that Bayesian statistics are not readily interpretable in the customary hypothesis testing framework.
Here, we present an algorithm that performs mixed model analysis in a small number of O(MN)-time iterations and increases power by modeling non-infinitesimal genetic architectures. Our algorithm fits a Gaussian mixture model of SNP effects [19], using a fast variational approximation [20–22] to compute approximate phenotypic residuals, and tests the residuals for association with candidate markers via a retrospective score statistic [23] that provides a bridge between Bayesian modeling for phenotype prediction and the frequentist association testing framework. We calibrate our statistic using an approach based on the recently developed LD Score regression technique [24]. The entire procedure operates directly on raw genotypes stored compactly in memory and does not require computing or storing a genetic relationship matrix. In the special case of the infinitesimal model, we achieve results equivalent to existing methods at dramatically reduced time and memory cost.
We provide an efficient software implementation of our algorithm, BOLT-LMM, and demonstrate its computational efficiency on simulated data sets of up to 480,000 individuals. Our simulations also show that BOLT-LMM achieves increased association power over standard infinitesimal mixed model analysis of traits driven by a few thousand causal SNPs. We applied BOLT-LMM to perform mixed model analysis of nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) [25] and observed increased association power equivalent to an up to 10% increase in effective sample size. We demonstrate through theory and simulations that the boost in power increases with cohort size, making BOLT-LMM a promising approach for largescale GWAS.
Results
Overview of Methods
The BOLT-LMM algorithm consists of four main steps, each of which run in a small number of O(MN)-time iterations. These steps are: (1a) Estimate variance parameters; (1b) Compute infinitesimal mixed model association statistics (denoted BOLT-LMM-inf); (2a) Estimate Gaussian mixture parameters; (2b) Compute Gaussian mixture model association statistics (BOLT-LMM). Step 1a computes results nearly identical to standard variance components analysis but applies a new stochastic approximation algorithm that reduces time and memory cost by circumventing spectral decomposition, which is expensive for large sample sizes. Instead, the approximation algorithm only requires solving linear systems of mixed model equations, which can be accomplished efficiently using conjugate gradient iteration [26, 27]. Step 1b likewise circumvents spectral decomposition by introducing a new retrospective mixed model association statistic similar to GRAMMAR-Gamma [10] and MASTOR [23], which we compute—up to a calibration constant— using only solutions to linear systems of equations. We estimate the calibration constant by computing and comparing the new statistic and the standard prospective mixed model statistic at a random subset of SNPs, which can likewise be accomplished efficiently using conjugate gradient iteration. This procedure is similar in spirit to GRAMMAR-Gamma calibration but requires only O(MN)-time iterations.
Steps 2a and 2b are Gaussian mixture parallels of steps 1a and 1b. BOLT-LMM’s non-infinitesimal model amounts to a relaxation of the standard mixed model, which from a Bayesian perspective imposes a Gaussian prior distribution on SNP effect sizes. BOLT-LMM relaxes this modeling assumption by allowing the prior distribution to be a mixture of two Gaussians, giving the model greater flexibility to accommodate large-effect SNPs while maintaining effective modeling of genome-wide effects (e.g., ancestry). For the Gaussian mixture model, it is no longer tractable to perform exact posterior inference, so BOLT-LMM instead computes a variational approximation [20–22] that converges after a small number of O(MN)-time iterations. Step 2a applies this method within 5-fold cross-validation to estimate best-fit parameters for the prior distribution (taking into account variance parameters estimated in step 1a) based on out-of-sample prediction accuracy. If the prediction accuracy of the best-fit Gaussian mixture model exceeds that of the infinitesimal model by at least a specified amount, step 2b is then run to compute association statistics by testing each SNP against the residual phenotype obtained from the Gaussian mixture model and calibrating the test statistics against the results of step 1b using LD Score regression [24]. Otherwise, the BOLT-LMM association statistic is the same as BOLT-LMM-inf. Both step 1b and step 2b are performed using a leave-one-chromosome-out (LOCO) scheme to avoid proximal contamination [9,12]. Further details of the method are provided in Online Methods and the Supplementary Note. The key properties of BOLT-LMM in terms of speed and model specification are compared to existing mixed model association methods in Table 1.
BOLT-LMM is much more computationally efficient than existing methods
To analyze the computational performance of BOLT-LMM, we simulated data sets of sizes ranging up to N = 3,750 to 480,000 individuals and M = 300,000 SNPs. We used genotypes from the WTCCC2 data set [28] analyzed in ref. [12], which contains 15,633 individuals of European ancestry, to form mosaic chromosomes, and we used a phenotype model in which 5,000 SNPs explained 20% of phenotypic variance (Online Methods).
We benchmarked BOLT-LMM against existing mixed model association methods, running each method for up to 10 days on machines with 96GB of memory. BOLT-LMM completed all analyses through N = 480,000 individuals within these constraits, whereas previous methods could only analyze a maximum of N = 7,500–30,000 individuals (Fig. 1). All previous methods require O(MN2) running time, whereas BOLT-LMM requires only ≈O(MN1.5) time (Fig. 1a and Supplementary Fig. 1a). We also observed substantial savings in memory use with BOLT-LMM, which requires little more than the MN/4 bytes of memory needed to store raw genotypes, much less than existing mixed model methods (Fig. 1b and Supplementary Fig. 1b).
The running time of BOLT-LMM depends not only on the cost of matrix arithmetic, which scales linearly with M and N, but also the number of O(MN)-time iterations required for convergence, which is roughly O(N0.5) and also varies with heritability, relatedness, and population structure (see Supplementary Note, Supplementary Fig. 1 and Supplementary Fig. 2). These observations apply both to the full Gaussian mixture modeling performed by BOLT-LMM and to the subset of the computation (steps 1a and 1b) needed to compute BOLT-LMM-inf infinitesimal mixed model association statistics, which in our benchmarks required about 40% of the full BOLT-LMM run time (Fig. 1a and Supplementary Fig. 1a). Our results show that even on very large data sets, BOLT-LMM is efficient enough to enable mixed model analysis using a Gaussian mixture prior, which we recommend because of its potential to increase power.
Simulations: BOLT-LMM increases power while controlling false positives
To assess the power of BOLT-LMM to detect associated loci, we performed additional simulations using real genotypes from the WTCCC2 data set, which is an ancestry-stratified sample containing both Northern and Southern European samples. We simulated phenotypes with 1250– 10,000 causal SNPs [13,14] explaining 50% of phenotypic variance and an additional 60 candidate causal SNPs explaining 2% of the variance. We further introduced environmental differences in ancestry by including a component of phenotype aligned with the top principal component that explained an additional 1% of the variance. (We note that principal component analysis is not part of BOLT-LMM; our recommendation, consistent with the recommendation of ref. [12], is that it is not necessary to perform PCA when running mixed model association methods.) We chose causal SNPs randomly from the first halves of chromosomes, leaving the second halves of chromosomes to contain only non-causal SNPs (Online Methods).
We computed χ2 association statistics using linear regression with 10 principal components (PCA) [29], GCTA-LOCO [12], BOLT-LMM-inf, and BOLT-LMM. We were unable to run FaSTLMM-Select [15] because of its memory requirements (Fig. 1). For each method, we computed means of its χ2 statistics over candidate causal SNPs and compared these means across simulation setups involving different numbers of causal SNPs (Fig. 2a). We observed that BOLT-LMM achieved power gains by modeling non-infinitesimal architectures. For the sparsest genetic architecture (1250 causal SNPs plus 60 causal candidate SNPs), we observed a 25% increase in mean BOLT-LMM χ2 statistics at candidate SNPs compared to GCTA-LOCO and BOLT-LMM-inf infinitesimal mixed model χ2 statistics. This metric is readily interpretable as corresponding to a 25% increase in effective sample size; for completeness, we also computed traditional power curves at two significance thresholds (Supplementary Fig. 4). The power gain of the Gaussian mixture model decreased with increasing numbers of causal SNPs (Fig. 2a). This behavior is expected because the advantage of the Gaussian mixture lies in its ability to more accurately model a small fraction of SNPs with larger effects amid a majority of SNPs with near-zero effects. Larger numbers of causal SNPs explaining a fixed proportion of variance result in smaller effect sizes per causal SNP, giving BOLT-LMM less opportunity for power gain. In contrast, all methods other than BOLT-LMM had performance independent of the number of causal SNPs, consistent with the fact that none of these methods model non-infinitesimal genetic architectures. GCTA-LOCO and BOLT-LMM-inf mean χ2 statistics at candidate causal SNPs were essentially identical and slightly exceeded PCA, consistent with theory [12]. We also tested EMMAX [3] and GEMMA [6], which are vulnerable to proximal contamination [9, 12]; these methods suffered loss of power relative to PCA (Supplementary Fig. 3a), consistent with theory [12].
To further explore the relationship between the magnitude of Gaussian mixture model power gain and other parameters of the data set, we also varied the proportion of variance explained by causal SNPs (Fig. 2b), and the number of samples in our simulations (Fig. 2c). We observed that the boost in power of BOLT-LMM over infinitesimal mixed model analysis (GCTA-LOCO, BOLT-LMM-inf) increased with each of these parameters. In further simulations using data sets of size N = 30,000 and N = 60,000 (Online Methods) and simulated phenotypes with Mcausal = 250–15,000 causal SNPs explaining 15–35% of the variance, we observed that the effectiveness of the Gaussian mixture model is closely tied to (where is heritability explained by genotyped SNPs); intuitively, this quantity measures the effective number of samples per causal SNP (Supplementary Fig. 5). These results are consistent with theory, which explains that in the absence of confounding, both infinitesimal and Gaussian mixture model analysis provide a power gain over marginal regression by conditioning on the estimated effects of other SNPs when testing a candidate SNP [9, 12]. As sample size increases, the power gain of both methods approaches an asymptote corresponding to an increase in effective sample size of but for sparse genetic architectures, the Gaussian mixture model can approach this asymptote much faster.
To verify that BOLT-LMM is correctly calibrated and robust to confounding, we also computed mean χ2 statistics across SNPs on the second halves of chromosomes, simulated to all have zero effect (“null SNPs”). Because our simulated phenotypes included an ancestry effect, linear regression without correcting for population stratification suffered 35% inflation. In contrast, the BOLT-LMM and BOLT-LMM-inf statistics were both well-calibrated (Supplementary Fig. 3b, Supplementary Table 3, and Supplementary Table 4). We further verified that Type I error was properly controlled (Online Methods and Supplementary Table 5) and that the distribution of statistics at null SNPs did not deviate noticeably from a chi-square with 1 degree of freedom (Supplementary Fig. 6a,b). Genomic inflation factors [30]) for BOLT-LMM and BOLT-LMM-inf exceeded 1 in these simulations (Supplementary Fig. 6c,d), consistent with polygenicity of the simulated phenotype and use of a mixed model statistic that successfully avoids proximal contamination [12, 13]. In contrast, EMMAX and GEMMA had deflated test statistics (Supplementary Fig. 3b).
Finally,we investigated the similarity between the BOLT-LMM-inf infinitesimal mixed model statistic versus existing methods at the level of individual SNPs. Despite its use of an infinitesimal model, the BOLT-LMM-inf statistic is not identical to any existing mixed model statistic because it is an approximate retrospective test statistic that avoids proximal contamination (Online Methods and Table 1). Nonetheless, we observed that BOLT-LMM-inf statistics very nearly match GCTA-LOCO statistics (which use the standard prospective model), with R2 > 0.999 (Supplementary Table 6 and Supplementary Fig. 7).
BOLT-LMM increases power to detect associations for WGHS phenotypes
To assess the efficacy of Gaussian mixture model analysis for increasing power on real phenotypes, we analyzed nine phenotypes in the Women’s Genome Health Study (N = 23,294 samples, M = 324,488 SNPs after QC) (Online Methods). These phenotypes consisted of five lipid phenotypes, height, body mass index, and two blood pressure phenotypes; we chose to analyze these phenotypes because of the availability of published lists of associations from large-scale GWAS of these traits.
We compared the power of three association tests: linear regression with 10 principal components (PCA) [29], infinitesimal mixed model analysis with BOLT-LMM-inf, and Gaussian mixture modeling with BOLT-LMM. Because of memory constraints (Fig. 1), we were unable to run either GCTA-LOCO [12], FaST-LMM [5], or FaST-LMM-Select [15], which are the only previous methods that avoid proximal contamination (Table 1); however, GCTA-LOCO and BOLT-LMM-inf statistics are essentially identical (Supplementary Table 6 and Supplementary Fig. 7). To compare power among these methods, we computed two roughly equivalent metrics: mean χ2 statistics at known associated loci, a direct but somewhat noisy approach due to having only 19–180 loci for each trait (Supplementary Table 7), and out-of-sample prediction R2 (measured in cross-validation) using all SNPs for the mixed model methods and using only PCs for linear regression. For mixed model analysis, the latter approach estimates the ability of the mixed model to condition on effects of other SNPs when testing a candidate SNP, which drives its power (Online Methods) [12, 31].
BOLT-LMM achieved higher power than PCA for all traits studied (Fig. 3 and Supplementary Table 8). Most of the increase was due to gains over infinitesimal mixed model analysis, with the magnitude of this power gain increasing with inferred concentration of genetic effects at few loci (Supplementary Table 9). The standard errors of the direct method of assessing improvement (mean χ2 at known loci) were somewhat high (0.6–2.2%; Fig. 3a and Supplementary Table 8), so the improvement was statistically significant (p < 0.05) for only 6 of the 9 traits. On the other hand, all of the improvements were statistically significant (p < 0.0002) according to the prediction R2 metric (Fig. 3b and Supplementary Table 8). The largest gains were achieved for lipid traits; for ApoB, a lipoprotein closely related to LDL cholesterol, BOLT-LMM analysis achieved a 10% increase in mean χ2 statistics versus PCA and a 9% increase versus infinitesimal mixed model analysis at known loci. To verify that these increases were not merely driven by a few loci with the largest effects, we also computed flat averages across loci of improvements in χ2 statistics (restricting to loci replicating in WGHS with at least nominal p < 0.05 significance to reduce statistical noise), obtaining consistent results (Supplementary Table 7). As noted above, simulations show that these improvements will increase with sample size (Fig. 2c and Supplementary Fig. 5).
We also observed that infinitesimal mixed model analysis achieved statistically significant power gains over PCA, with the magnitude of the power gains increasing with the heritability parameter estimated by BOLT-LMM (Fig. 3 and Supplementary Table 8), which we refer to as pseudo-heritability see Online Methods), following ref. [3]. For height in WGHS), the moderately large sample size of WGHS (N = 23,294) was enough to obtain a 6% increase in BOLT-LMM-inf χ2 statistics versus PCA, consistent with theory [12, 31]. Once again, larger sample sizes will enable further gains [12, 31].
To verify that BOLT-LMM successfully corrected for confounding from population structure in WGHS, we computed mean χ2 statistics across all typed SNPs and genomic inflation factors for the three methods compared above as well as uncorrected marginal linear regression. We observed that PCA, BOLT-LMM-inf, and BOLT-LMM statistics were consistently calibrated, while uncorrected linear regression statistics were inflated, especially for height (Supplementary Table 10). We further verified that genetic variation at the lactase gene had a false-positive genome-wide significant association with height using uncorrected marginal regression [32], which disappeared when using PCA, BOLT-LMM-inf, and BOLT-LMM (Supplementary Table 11).
Discussion
We have described a new algorithm for fast Bayesian mixed model association, BOLT-LMM, and demonstrated that it has time complexity ≈O(MN1.5) and requires only ≈MN/4 bytes of memory, resulting in orders-of-magnitude improvements in computational efficiency over existing methods for large data sets. We have further shown in simulations and analyses of WGHS phenotypes that the Gaussian mixture modeling capability of BOLT-LMM enables increased association power over standard mixed model analysis while controlling false positives. Among WGHS lipid traits, we observed power increases equivalent to increases in effective sample size of up to 10% over PCA and 9% over standard mixed model analysis.
BOLT-LMM is an advance for two main reasons. First, as sample sizes continue to increase, mixed model analysis is simultaneously becoming more important—in order to correct for population structure and cryptic relatedness in very large data sets—but lees practical with existing methods, all of which have ≥ O(MN2) time complexity and high memory requirements. The algorithmic innovations of BOLT-LMM overcome this computational barrier. Second, the ability of BOLT-LMM to better model non-infinitesimal genetic architectures enables a power gain relative to standard mixed model analysis. Recent methodological progress in this direction includes the multi-locus mixed model (MLMM) [7], which identifies and conditions out large-effect loci as fixed effects, and FaST-LMM-Select and related methods [9, 11, 15, 16, 33], which adopt a sparse regression framework that restricts the mixed model to a subset of markers. However, these methods all face the same O(MN2) computational hurdle as standard mixed model analysis.
Bayesian methods have previously been developed that apply non-infinitesimal models to improve the accuracy of genetic risk prediction, but translating Bayes factors and posterior inclusion probabilities into the frequentist hypothesis testing framework favored by the GWAS community is a challenge [34]. The variational Bayes spike regression (vBsr) method [35] is a recent step toward addressing this issue, proposing a z-statistic heuristically calibrated by assuming that the vast majority of variants are unassociated (as in genomic control [30]), but such a technique is prone to deflation when large sample sizes cause inflation due to polygenicity [13, 24]. BOLT-LMM sidesteps this difficulty via its hybrid approach of leaving each chromosome out in turn, fitting a Bayesian model on the remaining SNPs, and then applying a retrospective hypothesis test for association of left-out SNPs with the residual phenotype. In contrast to than modeling all SNPs simultaneously and assessing evidence for association using Bayesian posterior inference [34], our approach generalizes existing mixed model methods that are widely used, and we believe its ability to harness the power of Bayesian analysis while still computing frequentist statistics will be useful to GWAS practitioners. Additionally, such a hybrid approach lends itself readily to efficiently testing millions of imputed SNP dosages for association while including only typed SNPs in the mixed model, which we recommend in order to limit computational costs.
While BOLT-LMM improves upon existing mixed model association methods in both speed and power, BOLT-LMM still has limitations. First, the power gain that BOLT-LMM offers over existing methods via its more flexible prior on SNP effect sizes is contingent on the true genetic architecture being sufficiently non-infinitesimal and the sample size being sufficiently large (Supplementary Fig. 5). Second, BOLT-LMM, like existing mixed model methods, is susceptible to loss of power when used to analyze large ascertained case-control data sets in diseases of low prevalence [12]. We recommend BOLT-LMM for randomly ascertained quantitative traits, ascertained case-control studies of diseases with prevalence ≥5% (e.g., type 2 diabetes, heart disease, common cancers, hypertension, asthma) (see Supplementary Table 12), and studies of rarer diseases in large, non-ascertained population cohorts [36, 37]. For large ascertained case-control studies of rarer diseases, we are developing a method of modeling ascertainment using posterior mean liabilities [38]; applying the techniques of BOLT-LMM to these posterior mean liabilities is an avenue for future research. Third, while mixed model analysis is effective in correcting for many forms of confounding, performing careful data quality control remains critical to avoiding false positives. Fourth, our work does note estimate the heritability explained by genotyped SNPs ref. [39])—because may be different from (see Online Methods)—and does not conduct or evaluate genetic prediction in external validation samples from an independent cohort [31]. Fifth, we have not studied the performance of mixed model methods in data sets dominated by family structure [23]; this will be investigated elsewhere. Sixth, while BOLT-LMM extends the infinitesimal model by generalizing the SNP effect prior to a mixture of two Gaussians, other priors are possible and may be more appropriate for some genetic architectures (Table 1 of ref. [19]). Seventh, the running time of BOLT-LMM scales with the number of phenotypes analyzed; for data sets with a very large number of phenotypes (P), the GRAMMAR-Gamma method [10], which has running time O(MN2 + MNP) (reviewed in ref. [12]) may be faster. Finally, we have developed fast mixed model analysis for a mixed model with one random genetic effect; extending the algorithm to model multiple variance components [40] is a direction for future work.
URLs. BOLT-LMM software, http://www.hsph.harvard.edu/alkes-price/software/.
Acknowledgments
We are grateful to M. Lipson, S. Simmons, A. Gusev, K. Galinsky, J. Yang, P. Visscher, and Z. Zhu for helpful discussions. This research was funded by NIH grant R01 HG006399 and NIH fellowship F32 HG007805. The WGHS is supported by HL043851 and HL080467 from the National Heart, Lung, and Blood Institute and CA047988 from the National Cancer Institute, the Donald W. Reynolds Foundation and the Fondation Leducq, with collaborative scientific support and funding for genotyping provided by Amgen.