Abstract
Genetic research utilizes a decomposition of trait variances and covariances into genetic and environmental parts. Our software package biMM is a computationally efficient implementation of a bivariate linear mixed model for settings where hundreds of traits have been measured on partially overlapping sets of individuals.
Availability Implementation in R freely available at www.iki.fi/mpirinen.
1 Introduction
Decomposing phenotypic variance and covariance into genetic and environmental parts is important for designing genetic studies and understanding relationships between traits and diseases. The two main approaches are linear mixed model (LMM) implementations, such as GCTA (Yang et al., 2011), GEMMA (Zhou and Stephens, 2014) or BOLT-REML (Loh et al., 2015), and LD-score regression, implemented in LDSC (Bulik-Sullivan et al., 2015). LMM requires access to the individual-level genotype-phenotype data whereas LDSC only needs output from a genome-wide association study (GWAS) and variant correlations from a reference database, but consequently may be less precise than LMM (Bulik-Sullivan et al., 2015).
We consider settings where individual-level data are available, and hence use LMM. The bivariate LMM for n individuals is , where is 2n-vector of mean-centered phenotype values from which the covariates, such as age, sex and principal components of population structure have been regressed out, is 2n-vector of genetic random effects and is 2n-vector of environmental random effects. The (2n) × (2n) covariance structures are parameterized by six scalars: genetic variances VG1 and VG2, genetic covariance VG12, environmental variances Vε1 and Vε2 and environmental covariance Vε12 as expressed as n × n block matrices. I is the identity matrix and the element i, j of the genetic relationship matrix (GRM) R is where gik is the genotype of individual i at variant k, coded as 0, 1 or 2 copies of the minor allele and is the minor allele frequency (MAF). We use the standard scaling of allelic effects determined by α = −1.
From this model, an estimate of VGt approximates additive genetic variance of each trait (t = 1, 2) explained by the variants included in the calculation of R and is often used as a lower bound for the (narrow-sense) heritability (detailed assumptions in Yang et al. (2015)). An estimate of the genetic correlation measures (average) correlation of the allelic effects of the variants on the two traits. Similarly, we can estimate , the correlation in the environmental components between the traits.
A challenge with bivariate LMMs, that operate on an explicit R matrix (e.g. GCTA and GEMMA), is that they require matrix operations cubic in the cohort size for each pair of traits analyzed, which becomes computationally prohibitive for handling hundreds of traits measured on 10,000s of individuals. Our software package biMM speeds up the bivariate LMM analysis (1) by a fast likelihood computation, (2) by reusing matrix decompositions across pairs of traits, and (3) by arranging data to optimize sample overlap between consecutive pairs of traits.
2 Methods
2.1 Reusing eigendecomposition
Once an eigendecomposition of R is available, our biMM algorithm drops the time complexity from cubic to quadratic for a trait pair and from cubic to linear for a single evaluation of the likelihood function (Supplementary Information). A crucial observation is that a complete sample overlap between two trait pairs means that the same eigendecomposition can be used for both pairs.
2.2 Ordering pairs, imputing and dropping values
We order the trait pairs in such a way that the consecutive pairs have a large sample overlap. biMM further allows imputing at most ti missing values and/or dropping at most td non-missing values for a trait pair to make it match the available eigendecomposition (Supplementary Information). Only when this is not possible for any remaining pair does biMM a new eigendecomposition. Algorithmically, given user-specified ti and td, biMM finds an ordering that results in a small number of total eigendecompositions. This is an instance of the shortest Hamiltonian path problem that we tackle by a greedy heuristic (Supplementary Information).
2.3 Example analysis
We consider data from the Northern Finland Birth Cohort 1966 (NFBC1966) (Rantakallio et al., 1969) with 16 traits having sample sizes between 4736 and 5025 individuals (Supplementary Table S1) and preprocessed by Tukiainen et al. (2014). We analyzed all 120 pairs of traits using both the complete (ti = td = 0) and an approximate versions (ti = td = 200) of biMM and compared with GCTA 1.25.3, GEMMA 0.94.1 and BOLT-REML 2.2 with their default parameters.
3 Results
Figure 1 shows that the complete and approximate versions of biMM are very similar across the 120 pairs of traits. Table 1 shows that the approximate version is much faster than either the complete version or any other software package tested. Detailed results are in Supplementary Figures S1-S4. In short, biMM and GEMMA gave essentially the same results and they were also similar to the results from GCTA and BOLT-REML.
Funding
This work was supported by the Academy of Finland [257654 and 288509 to M.P.; 286607 and 294015 to P.M.; 251217 and 255847 to S.R.]. S.R. was supported by EU FP7 projects ENGAGE (201413) and BioSHaRE (261433), the Finnish Foundation for Cardiovascular Research, Biocentrum Helsinki and the Sigrid Juselius Foundation. NFBC1966 received financial support from University of Oulu Grant no. 65354, Oulu University Hospital Grant no. 2/97, 8/97, Ministry of Health and Social Affairs Grant no. 23/251/97, 160/97, 190/97, National Institute for Health and Welfare, Helsinki Grant no. 54121, Regional Institute of Occupational Health, Oulu, Finland Grant no. 50621, 54231.
Acknowledgements
This study made use of NFBC1966 data. We thank the late professor Paula Rantakallio (launch of NFBC1966), the participants in the 31yrs study and the NFBC project center.
Footnotes
↵* Contact: matti.pirinen{at}helsinki.fi