LiMMBo: a simple, scalable approach for linear mixed models in high-dimensional genetic association studies

Genome-wide association studies have helped to shed light on the genetic architecture of complex traits and diseases. Deep phenotyping of population cohorts is increasingly applied, where multi-to high-dimensional phenotypes are recorded in the individuals. Whilst these rich datasets provide important opportunities to analyse complex trait structures and pleiotropic effects at a genome-wide scale, existing statistical methods for joint genetic analyses are hampered by computational limitations posed by high-dimensional phenotypes. Consequently, such multivariate analyses are currently limited to a moderate number of traits. Here, we introduce a method that combines linear mixed models with bootstrapping (LiMMBo) to enable computationally efficient joint genetic analysis of high-dimensional phenotypes. Our method builds on linear mixed models, thereby providing robust control for population structure and other confounding factors, and the model scales to larger datasets with up to hundreds of phenotypes. We first validate LiMMBo using simulations, demonstrating consistent covariance estimates at greatly reduced computational cost compared to existing methods. We also find LiMMBo yields consistent power advantages compared to univariate modelling strategies, where the advantages of multivariate mapping increases substantially with the phenotype dimensionality. Finally, we applied LiMMBo to 41 yeast growth traits to map their genetic determinants, finding previously known and novel pleiotropic relationships in this high-dimensional phenotype space. LiMMBo is accessible as open source software ( https://github.com/HannahVMeyer/limmbo ).

where the N × P phenotype matrix Y for N individuals and P traits is modelled as the 72 sum of a genetic (or polygenic) component G and a noise component Ψ. We have where R denotes the N × N genetic relationship matrix, I N is the N × N identity 76 matrix and C g and C n are the genetic and the residual trait P × P covariance matrices, 77 respectively. The marginal likelihood of the model in equation Eq. 2 can be expressed in 78 terms of a multivariate normal distribution of the form 79 p (Y|C g , R N , C g ) = N (vecY|0, C g ⊗ R N + C n ⊗ I N ) , where the covariance structure of the phenotypes (in shape of the N × P phenotype  The key innovation in LiMMBo is to perform the variance decomposition on b 98 bootstrap samples of a subset of s traits, and then to appropriately use those bootstrap 99 samples to reconstruct the full C g and C n matrices (Fig. 1, right panel). By breaking 100 down the computationally expensive variance estimation step into b independent steps, 101 LiMMBo reduces the overall complexity of the 'standard REML' to 102 required for the final covariance matrices C g and C n and is achieved by finding the best 104 fit of the bootstrap estimates to a P × P covariance matrix (see Methods). While this 105 approach keeps the complexity at O(P 2 ), it has the additional advantage of allowing for 106 trivial parallelisation of the covariance estimation step. The variance decomposition of 107 each bootstrap is computed independently and our implementation allows for making 108 use of multiple cores. Variance decomposition with REML and LiMMBo. REML approach (left): the phenotype set of P traits and N samples is decomposed into its P × P trait-to-trait covariances C g and C n , based on the provided N × N genetic sample-to-sample kinship estimate matrix R. The noise sample-to-sample matrix I is assumed to be constant (identity matrix). Existing implementations of LMM to fit such variance decomposition (VD) models via REML are limited to moderate numbers of phenotypes. LiMMBo approach (right): for higher trait-set sizes, the phenotypes' variance components are estimated on b s-sized subsets of P which are subsequently combined into the overall P × P covariance matrices C g and C n .

110
To assess the scaling of LiMMBo, we fit the model to simulated datasets with increasing 111 numbers of phenotypes. Fig. 2 shows both the overall compute time and a break down 112 into the two main components of the method, bootstrapping and the combination of the 113 bootstrapping results. The majority of compute time is needed for the variance 114 decomposition of the bootstrapped subsets, which can be trivially parallelised across 115 bootstraps. As a comparison, the time taken by the standard REML approach quickly 116 exceeds the time of LiMMBo and becomes infeasible for more than 30 traits.

117
LiMMBo yields reliable covariance estimates 118 To assess the accuracy of LiMMBo for covariance estimation of the trait covariances C g 119 and C n , we considered in-silico simulations with known ground truth. We simulated 120 datasets with different extents of population structure (  Scalability of REML and LiMMBo. Empirical run times for standard REML inference and LiMMBo on ten simulated datasets with N = 1, 000 individuals, increasing number of traits (P ∈ {10, . . . , 100} and different amount of variance explained by the genetics (h 2 ∈ {0.2, 0.5, 0.8}). Shown is the average empirical run time for 30 experiments per trait size (10 per h 2 ) , with error bars denoting plus or minus one standard deviation across experiments. Lines denote a fit of the theoretical complexity to the observed run times: bootstrapping step (orange): b(N s 4 + s 5 ); the combination of the bootstrapping (blue): P 2 , their combined run time (turquoise): b(N s 4 + s 5 ) + P 2 and the standard REML approach (red): N P 4 + P 5 . b: number of bootstraps, s: bootstrap size, P : phenotype size, N : sample size. The majority of the run time is allocated to the bootstrapping. Run times for the standard REML are depicted up to P = 40 when they already exceed the run times for P = 100 in the LiMMBo approach.
Next, we assessed the utility of LiMMBo estimates of the trait covariance matrices 130 for carrying out multi-trait GWAS. Specifically, we assessed the calibration of type-I 131 error rates using phenotypes simulated from the null (no causal variant) and applying 132 LiMMBo to fit the the trait covariances C g and C n which we then used in a 133 multivariate LMM [17]. For low to moderate number of traits we compared the 134 calibration to a multi-trait LMM using standard REML-derived estimates of C g and 135 C n . Association tests based on the random effect covariance estimates using both 136 inference schemes were calibrated across varying proportion of variance explained by 137 genetic effects and different numbers of traits, including the regime of large P , where 138 existing methods cannot be applied (Fig. 3B).

139
In principle, multivariate genetic analyses in higher dimensions are also possible  Multi-trait genotype to phenotype mapping increases power for 149 high-dimensional phenotypes 150 First, we consider simulated data to assess whether the power benefits of multivariate 151 LMMs translate to high-dimensional phenotypes when using a fixed effect test with as 152 many degrees of freedoms as traits. We examined a wide range of simulation settings, 153 simulating up to 100 traits and varying proportions of traits affected by genetic variants. 154 The mean genetic variance across all traits was kept constant (i.e with an increase in 155 the affected traits the contribution of the genetic component per trait decreases). For 156 each set-up, we simulated 50 different phenotypes and estimated the trait covariance 157 matrices C g and C n via LiMMBo. 158 We used these estimates in a multivariate LMM to test the association between the 159 known causal SNPs (from simulation) and the phenotypes and compared them to 160 results of univariate LMMs of the causal SNPs and the phenotypes.  increasing variance explained by genetics, as the effect sizes of the SNPs (fixed for all 175 simulations) become negligible compared to the overall genetic variance. However, the 176 multivariate model is still able to exploit the correlation of the SNP effects across traits 177 and detects more SNPs in cases of high genetic background structure (Fig. 4C). An

Multi-trait GWAS in yeast 180
Next, to illustrate the utility of LiMMBo, we applied the model to a QTL dataset of 181 growth rates measured in a yeast cross cross for 41 different conditions [2], which is a 182 well-placed complexity of traits given our simulations. Fig. 5A Fig. S5. 196 In addition to the significance of an association, linear mixed models provide effect 197 size estimates for the tested SNPs. By analysing the effect sizes of significantly 198 associated SNPs across traits, we can explore pleiotropic effects of these significant loci. 199 We limited the analysis to the multi-variate effect size estimates from significant 200 variants located within a gene body to ensure we could link the variants to specific 201 genes for downstream analysis. Subsequently, we pruned these variants for LD (r 2 > 0.2, 202 3kb window; see methods) to remove potential bias in the clustering due to an 203 over-representation of variants from large loci.

204
Most of these resulting 210 loci have strong effect size estimates for more than one 205 trait, i.e., most loci seem to be pleiotropic (Fig. 5B, non-zero effect sizes in columns).

209
To gain more insights into the trait architecture, we analysed the effect size estimates 210 across loci and traits using hierarchical clustering (robustness of the clustering assessed 211 via a bootstrapping-based method (pvclust, [27])). Overall, we observed 14 stable trait 212 and 136 stable variant clusters, many of which are nested; these clusters collapsed into 6 213 trait and 19 variant exclusive groupings at the highest branch ( Supplementary Fig. S3). 214 These six significant clusters for traits ( Effect size estimates of LD-pruned (3kb window, r 2 > 0.2), significant SNPs located within a gene were clustered by loci and traits (both hierarchical, average-linkage clustering of Pearson correlation coefficients). Traits and SNPs in stable clusters (pvclust p < 0.05) are marked in blue. Grey boxes indicate stable SNP clusters spread across at least two chromosomes; a and b label two clusters for which suggestive common annotation was found, for details see text.
Finally, for comparison, we repeated the same clustering analysis based on the 232 results from the single-trait model. This identified only 9 stable traits and 117 clusters 233 ( Supplementary Fig. S4 matrices, which usually introduces large biases in the final use of these models. We are 277 actively exploring the use of the LiMMBo covariance estimation in this and other areas. 278 Our ability to generate large cohorts of well phenotyped and genotyped individuals 279 has forced the development of many new methods in statistical genetics. With the 280 advent of genotyped human cohorts up to 500,000 individuals with over 2,000 different 281 traits [37], and plant phenotyping routinely in the 1,000s of individuals from structured 282 crosses with 100s to 1,000s of image based phenotypes [3,13], we need both informative 283 and scalable methods. LiMMBo extends the reach of linear mixed models into this new 284 regime, allowing for new complex genetic associations to be made and a more 285 informative exploration of the underlying biological effects.

287
The LiMMBo implementation and all analyses scripts can be found at  In detail, from the total phenotype set with P traits, b subset of s traits are 300 randomly selected. b depends on the overall trait size P and the sampling size s and is 301 chosen such that each two traits are drawn together at least c times (default: 3).

302
For each subset, the variance decomposition is estimated via 303 mtSet.pycore.modules.multiTraitSetTest and .fitNull i.e. the null model of the mvLMM 304 (Eq. 1). For each bootstrap, the s × s covariance matrices C s g and C s n are recorded.

305
The challenge lies in combining the bootstrap results in a way, that the resulting C g 306 and C n matrices are covariance matrices i.e. positive semi-definite, and serve as good 307 estimators of the true covariance matrices. First, the covariance estimates for each trait 308 pair are averaged over the number of times they were drawn. The covariance estimates 309 of the b subsets are then combined by a least-squares fit to the closest .setParamsCovariance. The model makes use of a Cholesky decomposition of the initial 315 matrices to be fitted via .getParams(), resulting in 1 2 P (P + 1) model parameters to be 316 fitted. C g and C n are fitted separately.

317
Genetic association testing with LiMMBo 318 Genetic association testing with LiMMBo is split into two parts. First, the full trait 319 covariance matrices of the genetic C g and non-genetic C n random effects are estimated 320 via the covariance estimation scheme outlined above. Second, the covariance estimates 321 are used as input parameters for the multivariate linear mixed model with the full 322 phenotype set as the response variable, the genetic variant of interest as fixed effect and 323 possible, further non-genetic covariate effects (Eq. 5).

324
Data simulation 325 Genotypes. Genotypes were simulated similar to strategies described in [19,41]. A 326 cohort of 1,000 synthetic genotypes were generated based on real genotype data from 327 four European ancestry populations of the 1,000 Genomes (1KG) Project (populations: 328 CEU, FIN, GBR, TSI) [24]. Each newly synthesised individual is assigned to a specified 329 number of ancestors a from the original 1KG Project and their genome split into blocks 330 of 1,000 SNPs (thereby retaining realistic LD structure between SNPs). For each SNP 331 block, the ancestor is randomly chosen and its genotype is copied to the individuals 332 genome. Choosing a low number of ancestors for the simulation introduces relatedness 333 among individuals, while high numbers lead to unstructured populations. Two cohorts 334 were simulated, one with a low number of ancestors (a = 2, popStructure) and one with 335 a high number of ancestors (a = 10, noPopStructure). The latter is only used for the    For all analyses in this study, an 'any effect' model allowing for independent effects 386 across all traits was chosen, corresponding to W = I P .

387
Multivariate linear mixed models. The multivariate linear mixed model (mvLMM) is an extension of the mvLM through the addition of a genetic random effect G: where the random effects G and Ψ are described by matrix-normal distribution with 388 column covariance C g and C n and row covariance R and I N , respectively. C g and C n 389 are the P × P trait covariance estimates, R the N × N genetic relationship matrix and 390 I N a N × N identity matrix.

391
Univariate linear mixed models. In the univariate linear mixed model (uvLMM), 392 the genetic background and residual noise are modelled as random effects with a scalar 393 estimate for the genetic G and noise trait-variance Ψ, σ 2 g and σ 2 e . 394 y = FA + xβ T + g + ψ, Covariance comparison 395 The covariance matrices C g and C n were estimated via REML and LiMMBo and the LiMMBo and the standard REML approach were used to estimate the trait covariance 400 matrices C g and C n of the simulated phenotypes without genetic variant effects (null 401 model). C g and C n were then used as input for a multi-trait GWAS against all  [45][46][47]. With a conservative, theoretical significance threshold of p t = 10 −5 , at 449 most one SNP is expected to be false positive in a total of s = 11, 623 SNPs . To find 450 the empirical FDR corresponding to this threshold, k = 50 permutations of the 451 genotypes were generated and the LMMs fitted against these permutation. These 452 p-values were used as the empirical p-value distribution and for p t = 10 −5 , empirical 453 FDRs estimated as FDR mtGWAS = 1.2 × 10 −5 and FDR stGWAS = 8.6 × 10 −6 .

454
Effect size analyses The following effect size analysis was conducted independently 455 for the effect sizes from the single-trait and multi-trait analyses. All SNPs passing the 456 respective FDR threshold (single-trait or multi-trait) were pruned for LD (r 2 > 0.2, 3kb 457 window; as described above) and location within a gene body (yeast genome assembly: 458 ScerevisaeR64-1-1   Table S1. Linear mixed model frameworks for genetic association studies. A list of popular LMM frameworks, grouped by their usage of covariance estimates when fitting the alternative model (first column: E: exact, A: approximate). The complexity describes the complexity for fitting a single LMM as specified in the original publication or summarised elsewhere, as indicated by the footnotes. P indicates the trait size that the model was designed for (according to the original publication). Models with specific parameters are described in more detail in the text (FaST-LMM-select and TASSEL). N : number of samples; s c : number of SNPs used for singular value decomposition; c: compression factor with c = N g for g individuals per group; t, t 1 andt 2 : average number of iterations needed to find parameter estimates. GRAMMAR-Gamma, FaST-LMM-select: t steps of the Brent's algorithm; GEMMA, MTMM: t 1 steps of the EM algorithm, t 2 steps of the NR algorithm; BOLT-LMM: t steps of the variational Bayes and conjugate gradients; TASSEL: t steps of the ProcMixed algorithm in SAS; mtSet: t steps of the L-FBGS.

Framework
Complexity Among the exact methods, FaST-LMM-select reduces the complexity best in terms 639 of sample size by selecting the number of SNPs to use for the estimation of the  Table S2. Parameters for phenotype simulation. For each, the total variance is the sum of both effect variances and has to add to 1. Each component has a certain percentage of its variance that is shared across traits, while the rest is independent.

669
The dataset consists of phenotype and genotype data of 1,008 prototrophic haploid 670 Saccharomyces cerevisiae segregants derived from a cross between a laboratory strain 671 and a wine strain strains [2]. It contains 11,623 unique genotypic markers obtained via 672 short-read sequencing for all 1,008 segregants (no missing genotypes). For phenotyping, 673 segregants were grown on agar plates under 46 different conditions, including different 674 temperatures, pH and nutrient addition (see labels in Supplementary Fig. S8). The 675 phenotypes were defined as end-point colony size normalised relative to growth on 676 control medium. In the following, a trait is defined as the normalised growth size in one 677 condition.

678
Out of the 1,008 segregants, 303 were phenotyped for all 46 traits. Missing  Fig. S7A). 693 We chose the MICE framework [52] with PMM as the imputation method to determine 694 the most suitable imputation parameter settings in the simulated dataset which would 695 then be applied to impute the real missing values in the full dataset. The predictor 696 variables for each trait were determined based on their pair-wise Spearman's rank 697 correlation coefficient ρ with all other traits in the dataset ( Supplementary Fig. S8). In 698 addition, only predictor traits that had been measured in at least 20% of the samples in 699 the dataset were considered. Different predictor variable set-ups were examined based 700 on increasing thresholds for Pearson's correlation coefficient: r 2 ∈ {0, 0.1, 0.2, 0.3}.    [53]. The strength and the direction of significant correlations (p < 0.05) are depicted above. Non-significant correlations are left blank. The traits are clustered based on complete-linkage clustering of (1 − r 2 ) as distance measurement (R Package: corrplot).