Introduction

With the advance of high-throughput genotyping technology, the paradigm of mapping quantitative trait locus (QTL) based on the linkage analysis of sparse genetic markers has gradually shifted to genome-wide association studies (GWAS) based on thousands and thousands of single-nucleotide polymorphisms (SNPs). On the other hand, association studies tend to involve more than one quantitative traits or complex diseases located in different regions of chromosomes, allowing the investigation of common genetic risk factors underlying multiple traits. Although these traits could be analyzed separately with univariate genetic model, statistical methods and algorithms have been developed for simultaneously analyzing multiple normal traits (Jiang and Zeng, 1995; Fang et al., 2008; Ayroles et al., 2009; Zhu and Zhang, 2009; Stephens, 2010; Nadeau and Dudley, 2011; Shriner, 2012), multiple discrete traits (Lange and Whittaker, 2001; Xu et al., 2005; Yang et al., 2009) and multiple mixed traits of normal and discrete traits (Prentice and Zhao, 1991; Fitzmaurice and Laird, 1997; Liu et al., 2009).

With each quantitative trait being analyzed separately by the same genetic model, least squares estimation or maximum likelihood estimation gives the same genetic effect estimates as those from the joint analysis of multiple correlated trait. However, its significance test for QTL does not consider correlations among all the traits being analyzed. In contrast, jointly analyzing all correlated traits exhibits two distinct advantages. First, statistical power to detect QTL and the precision of parameter estimation (Jiang and Zeng, 1995; Zhu and Zhang, 2009) will be increased. Second, the complex statistical model leads to biologically meaningful conclusions, facilitating to address the issue of pleiotropy vs close linkage (Almasy et al., 1997; Liu et al., 2007) and to access the endophenotypes intermediate between a gene and a trait. Because of a large number of matrix calculations and the increased degrees of freedom of the test statistic (Weller et al., 1996), however, the multivariate analysis of all traits is extremely impractical when the number of quantitative traits is large. More recently, Verzilli et al. (2005) and Banerjee et al. (2008) employed seemingly unrelated regression model (Zellner, 1962) to map QTLs of correlated traits. With two multivariate models and the associated Bayesian algorithms, their modeling scheme outperforms the conventional multivariate model in terms of QTL identification.

When many correlated normal traits are collected, principal component analysis (PCA) and discriminant analysis are candidates to perform dimension reduction for these traits. By performing the PCA on all phenotypic traits based on their covariance matrix, a collection of the independent principal components of original traits, or ‘super traits’, could be obtained. Then a few leading principal components that explain the most variance of original phenotypes are chosen for separately mapping analysis (Weller et al., 1996; Mangin et al., 1998; Elston et al., 2000; Korol et al., 2001). With the regular PCA transformation, mapping results lack biological interpretability, as supper traits are a set of linear combinations of original traits. Although genetic effects of detected QTLs on super traits can always be back-transformed to those for original traits using the matrix of principal eigenvectors (Weller et al., 1996; Knott and Haley, 2000), this framework cannot produce equivalent parameter estimates to the joint analysis of original correlated traits. Specific to each tested position, the discriminant analysis can obtain one best linear combination of the traits from the estimated genetic and residual covariance matrices (Gilbert and Le Roy, 2003, 2004), improving the precision of parameter estimation and the statistical power of QTL detection.

A great volume of transcriptional expressions that are regarded as quantitative traits can be analyzed using the transcriptional expression QTL (eQTL) mapping aproaches (Brem et al., 2002; Schadt et al., 2003; Morley et al., 2004; Stranger et al., 2005; Wang et al., 2006). Several methods for eQTL mapping also motivate the modeling scheme of multiple quantitative trait mapping. By first clustering transcripts with similar expression into groups, sparse partial least-squares regression framework has been proposed to select markers associated with each cluster of genes (Chun and Keles, 2009). Adaptive multi-task least absolute shrinkage and selection operator (LASSO; Zhu et al., 2008) has been developed for detecting eQTLs that takes into account related expression traits simultaneously while incorporating many regulatory features. On the other hand, the graph-guided fused LASSO (Kim and Xing, 2009; Kim et al., 2009) considers regulatory networks over multiple expression traits within an association analysis, but previous knowledge on genomic locations is not incorporated. To date, however, most of the eQTL mapping approaches are still focusing on insufficient limited number of genetic markers from relatively small populations.

This article presents a statistical framework for analyzing many regular quantitative traits from GWAS, where a multivariate genetic model is constructed and each trait’s associations with all SNPs are tested using the same genetic model. An extremely fast LASSO (Yuan and Lin, 2005; Friedman et al., 2010) is employed to solve sparse oversaturated genetic model for each trait. Instead of working on principal components from phenotypic traits, this framework implements PCA for the estimated residual covariance matrix, so that multiple regular quantitative traits are transformed to a group of pseudo principal components or a group of pseudo traits. Based on this and the underlying transformation, the univariate analyses for pseudo traits give equivalent parameter estimates to joint multivariate analysis, but the computational burden for multiple quantitative traits mapping is largely reduced. Statistical and computational efficiencies of the proposed method are validated through extensive simulations and a real data set from a GWAS of 20 slaughtering traits and meat quality traits in beef cattle.

Method

Multivariate genetic model

In a GWAS involving multiple quantitative traits collected from a randomized population, t traits of interest are observed and m SNPs are genotyped on n subjects. By only considering the additive effects of SNPs, the phenotype of each trait can be partitioned into:

for i=1,2,···,n, l=1,2,···,t.

Where yil is the phenotypic value of the lth trait for the ith subject, βlj is the jth systemic environmental effect for the lth trait, xij is the incidence value for the ith subject in the jth systemic environmental effect, αlj is genetic effect of the jth marker on the lth trait, zij is the indicator variable of the jth marker for the ith subject, defined as 0 for heterozygote, −1 and 1 for the two homozygote, and eil is the residual error, which follows a multivariate normal distribution with being residual variance. We denote the simultaneous linear equations consisting of such models for t traits as the multivariate genetic model for mapping QTLs for multiple traits.

With vector notation, model (1) is written as

with , and .

The expectation of yi is

and its covariance matrix is V(yi)=Σe.

Shrinkage estimation for genetic effects

As phenotypes are correlated with each other but independent among subjects, the likelihood function L is then the product of individual multivariate normal distribution density, or

Assuming μi is known, the maximum likelihood estimate of residual covariance matrix is given by

In general, is positive definite, and thus can be decomposed into

according to the Eigen decomposition, where V is the matrix of eigenvectors and Λ is a diagonal matrix consisting of eigenvalues. Let and , then the likelihood function becomes

where, δl is the lth eigenvalue along the diagonal of matrix Λ, is defined as the lth pseudo principal component (or pseudo traits) for the ith subject and is the expected value of . As Equation (7), with this decomposition, can be partitioned into the product of t likelihood functions for t pseudo traits, the genetic model for each pseudo trait can be solved iteratively, although the pseudo traits may not be independent of each other. Based on this equivalent form of solution, the procedure could efficiently solve for genetic effects in the presence of multiple traits and a huge number of genetic markers. Nevertheless, when a fairly large number of traits are of interest, this procedure could focus on the first few leading pseudo traits, allowing reduction of computation costs in a lower-dimensional space.

In particular, we implement penalized likelihood-based shrinkage estimation for each pseudo trait defined in Equation (7). With thousands and thousands of SNPs, the number of unknown parameters estimated in is far greater than sample size, but the number of non-zero genetic effects is very small. Therefore, the LASSO regression with a coordinate descent step (Yuan and Lin, 2005; Friedman et al., 2010) can efficiently shrink most of genetic effects in to zeros in estimating the genetic effects associated with each pseudo trait. Denote the genetic effect of the jth SNP on the lth pseudo trait by , then the genetic effect is estimated by

for j=1,2,···, m and l=1,2,···, t′.

where, λ1 is a tuning parameter, which can be optimized with cross validation, and t′ is the total number of pseudo traits considered in the lower-dimensional space.

So far, we have outlined the statistical algorithms based on , where the expectation μi of yi is still assumed to be unknown. As univariate analysis for model (1) gives identical point estimates of genetic effects to those from multivariate analysis, we solve the mean equation for each original trait separately to attain an estimate of Σe. Specifically, the LASSO regression (Yuan and Lin, 2005; Friedman et al., 2010) can be used to estimate the oversaturated model (1) and efficiently estimate systemic environmental effects as well as non-zero genetic effects by solving

where λ2 is a tuning parameter to be determined by cross validation. The estimated model leads to the estimated expectation μi of yi and then .

To identify the genetic risk factors associated with multiple correlated traits, this framework transforms the phenotypic traits to a group of new traits using the eigenvectors of residuals covariance matrix. Approaching the problem in this way breaks down the complex problem into a sequence of analyzing individual pseudo trait separately. More importantly, it ensures the equivalency of parameter estimates between the two analysis frameworks. In sum, the parameter estimation can be implemented in the following steps:

  1. 1)

    Estimate the expectation for each trait by solving the objective function (9).

  2. 2)

    Calculate residual covariance matrix using (5).

  3. 3)

    Decompose into VTΛV.

  4. 4)

    Determine the number of pseudo principal components being considered according to the cumulative proportion contributed by eigenvalues in matrix Λ.

  5. 5)

    Generate the pseudo principal components by multiplying multiple phenotypes by a matrix of corresponding eigenvectors.

  6. 6)

    Estimate non-zero genetic effects for each pseudo principal component by solving Equation (8).

Statistical inference for genetic effects

After the shrinkage estimation of genetic effects for each pseudo trait, the number of non-zero genetic effects is generally less than sample size. By directly applying ordinary least squares estimation, the systemic environmental effects and the non-zero genetic effects can be unbiasedly estimated for each pseudo trait as follows

for l=1,2,···,t′ and j=1,2,···,q, where q is the number of selected non-zero genetic effects.

Also, residual variance for each pseudo trait is estimated by

The variance-covariance matrix of the estimated parameters is then calculated by

Finally, the significance of non-zero genetic effects can be statistically tested based on Equations (10), (11) and (12), and SNPs corresponding to significant non-zero genetic effects are identified as the QTLs for the pseudo quantitative trait. In order to interpret the genetic effect on the quantitative trait measured in the original scale, the genetic effect associated with each detected QTL is transformed by

where νl is the eigenvector corresponding to the lth principal component in matrix V, and is the jth estimated genetic effect for the lth pseudo trait.

Results

Simulated data

A total of 6000 SNPs with equal allele frequencies are simulated and evenly distributed across 6 chromosomes, with 1000 SNPs on each chromosome. Given constant correlations of 0.1 between two adjacent SNPs on the same chromosome, 6000 normally distributed random variables are first generated from a multivariate normal distribution with an expectation of 0 and given constant correlations. Then, indicator variable xij are generated as +1 if the random variable is >0.675, as −1 if it is <−0.675 and 0 otherwise. On each simulated chromosome, one or two SNPs (QTLs) are assumed to govern two normally distributed quantitative traits. The positions and genetic effects of 10 QTLs across 6 chromosomes are presented in Table 1. Residual variances for two traits are set to 1, so that residual covariance is equal to correlation between the two traits. With this setup, the heritabilities of 10 simulated QTLs range from 0 to 0.041 for the first trait and from 0 to 0.033 for the second trait. Phenotypic values are drawn from a bivariate normal distribution with the expectation μi and residual covariance matrix Σe, where the expectation μi can be calculated by the sum of the products of the simulated QTLs’ indicator variables and corresponding genetic effects. To evaluate influences of sample size and correlation between the two traits on mapping results, sample size is tested under two levels: 1000 and 2000, and correlation is set to one of four levels: 0, 0.2, 0.5 and 0.8.

Table 1 Positions and genetic effects of the QTLs simulated

The simulated data sets are analyzed by our proposed method (Residual PCA for short), joint analysis based on phenotypic PCA (Phenotypic PCA for short) and the conventional multivariate analysis scheme (Multivariate for short), respectively. To facilitate the comparison of the three analysis methods, all test statistics are transformed to –log(p) from the associated P-values. The simulations are repeated 500 times for estimating QTL parameters and accessing the statistical power of QTL detection. At 5% significance level, statistical power of QTL detection for each locus is calculated as the proportion of simulations where test statistic exceeds the critical value of 1.313. Also, false positive rate is evaluated with the 500 simulations under the null model without genetic effects on the two traits.

Table 2 shows the statistical power and false positive rate for QTL detections using the three analysis methods, and Table 3 reports the estimated QTL genetic effects when the correlation between two quantitative traits is 0.5. The results at other correlation levels are provided in Supplementary Tables S1–S4 of the Supplementary Material. In accordance with our expectations, each analysis method gives similar statistical patterns: (1) statistical power of QTL detection and the precision of parameter estimation increase as the QTL heritability increases, (2) statistical power of QTL detection is higher and false positive rate is lower as the QTL heritability increases and (3) large sample size is beneficial to identify QTL. All analysis methods are able to accurately find the simulated QTLs, with negligible deviations for positions. For various correlations between these two traits, Residual PCA method is basically identical to the joint analysis in terms of statistical power and QTL parameter estimation, but both methods distinctly outperform Phenotypic PCA method. In general, false positive rates are <10% for all scenarios. But Residual PCA method and Multivariate method deliver very similar false positive rates, which are clearly lower than those from the Phenotypic PCA method. Moreover, the relative statistical performance of these three analysis methods does not appear to depend on the correlation between the two traits. Although theoretically the estimates for QTL genetic effects should be the same between Multivariate method and Residual PCA method, minor discrepancies exist due to slightly different statistical powers of the two approaches.

Table 2 Statistical powers of QTL detection and false positive rates (FPR) obtained with three mapping methods for the simulated data sets with correlation 0.5
Table 3 Mean estimates and s.ds. (in parentheses) of QTL effects obtained with three mapping methods for the simulated data sets with correlation 0.5

We also record the computational time when implementing each analysis method for each simulated data set (results not shown). It can be seen that our proposed method takes almost the same computing time as that of Phenotypic PCA method, while Multivariate method takes about five times more computing time compared with our proposed method for sample size of 1000. As the sample size increases to 2000, the difference in computing time is further enlarged between our proposed method and the Multivariate method, suggesting the superior computational efficiency in additional to the statistical performance of the proposed approach.

Real data

Experimental population consists of 1058 young Simmental bulls born between 2008 and 2011, which are originated from Ulgai, Xilingol league, Inner Mongolia of China. After weaning, the cattle were moved to Beijing Jinweifuren cattle farm and were fattened under the same feeding and management environment. Growth and development traits for each individual were observed in a timely manner between 16 and 18 months old before slaughter. During the period of slaughter, carcass traits and meat quality traits were measured according to Institutional Meat Purchase Specifications for fresh beef guidelines. The blood samples were collected along with the regular quarantine inspection of the farms without the need of ethical approval. The DNAs were extracted from these blood samples using the routine procedures. The Illumina BovineHD BeadChip was adopted for quantifying and genotyping DNAs.

Before statistical analysis, SNPs were removed from the study if (1) their call rates are <90%, (2) minor allele frequency are <3% or (3) genotype appearance are <5 individuals or if they are departing from Hardy–Weinberg equilibrium with P-values <10−6. In addition, individuals with >10% missing genotypes or with >2% Mendelian error rates in genotyping are excluded. Finally, a total of 986 individuals and 631 396 SNPs were collected for the multiple-trait GWAS analysis.

Among a total of 40 carcass traits and meat quality traits, 20 are chosen to demonstrate the proposed method. These analyzed traits include live weight, carcass weight, net weight of beef (boneless), net weight of beef, head weight, forehoof weight, cowhide weight, oxtail weight, flank weight, ribeye weight, high rib weight, tenderloin weight, shin weight, shoulder weight, topside weight, silverside weight, top round weight, rump weight, shank weight and hoof weight. Phenotypic correlations among these traits, listed in Supplementary Table S5, are >0.40.

Environmental factors, such as measuring year and slaughtering age (in months), are included in the genetic model, and population stratification is taken into account as well. In the shrinkage estimation of genetic model for each trait, fold numbers for cross-validations are set from 3 to 10 to make sure each trait has non-zero genetic effect after shrinkage. Then pseudo traits in a lower-dimensional space are obtained by performing PCA on the residual covariance matrix as discussed in ‘Method’ section. The first two pseudo traits are analyzed, which together explain >85% of the residual covariance matrix variation.

At significance level of 0.05, 27 significant SNPs are identified as the QTLs for the first two pseudo traits. But for the clarity of tabulating mapping results, we report 14 SNPs out of these 27 detected SNPs in Table 4 by having a significance level of 0.001. As can be seen from Table 4, the heritabilities of these detected QTLs are overall very low for two pseudo traits, ranging from 0.00 to 0.13. The genetic effects of these detected QTLs are transformed to those for 20 original traits by eigenvectors corresponding to each pseudo principal components. The results provided in Supplementary Table S6 of Supplementary Material show that many genetic effects are small and even negligible. However, absolute values of genetic effects can not precisely reflect the impact of detectable SNPs on any original trait, as heritability also depends on each trait’s phenotypic variation. In fact, the heritabilities of detected QTLs on 20 analyzed traits can be calculated from the estimated genetic effects and the estimated residual variances, where the latter one can be estimated by diag (VTΛV) for original traits. It can be seen from Table 5 that, in general, the thirteenth and fourteenth QTLs have higher genetic influence on the analyzed traits than other detectable QTLs. Further, the heritability of QTL can also be used to indicate the extent to which the pleiotropy occurs.

Table 4 The detected SNPs for the first two pseudo principal components (SPC) of 20 carcass traits and meat quality traits in beef cattle
Table 5 Estimated heritabilities of the detected QTLs for 20 carcass traits and meat quality traits in beef cattle

Discussion

In the conventional phenotypic PCA for analyzing multiple traits, phenotypic covariance matrix Σp is firstly decomposed into and then phenotypes of multiple traits are orthogonally transformed to independent principal components through eigenvector matrix Vp. As Vp is an orthogonal matrix, the relationship between phenotypes (yi) and principal components (CPi) can be described as CPi=Vpyi and . Substituting into likelihood function (4) gives

Obviously, this likelihood function cannot be solved sequentially for principal components, because is a non-diagonal matrix. But if Σe=Σp, then with Λp being a diagonal matrix consisting of eigenvalues. This assumption, however, holds only in the case of no pleiotropic or closely linked QTLs for multiple traits. In contrast, our proposed method based on the PCA for residual covariance matrix is more general, which factorizes the likelihood function for multiple traits into multiple independent likelihoods for all pseudo principal components or pseudo traits. As a result, univariate analyses for pseudo traits give equivalent parameter estimates to the joint multivariate analysis under a linear transformation.

The key to implement the proposed method is the estimation of unknown residual covariance matrix. According to the equivalency of maximum likelihood estimate between univariate analyses and the joint analysis for the model (1) with the same genetic model for each trait, the residual covariance matrix in this study is estimated through the maximum likelihood estimation of genetic model for each trait. Note that the LASSO procedure (Yuan and Lin, 2005; Friedman et al., 2010) for estimating the sparse oversaturated genetic model for each trait leads to biased non-zero genetic effects due to forcing penalties, and the biased estimates for genetic effects are associated with the biased estimates of residual covariance matrix. However, by initializing with its biased estimate, residual covariance matrix could be iteratively estimated along with all other genetic effects. This iterative process can be carried out from step (2) to step (5) in the outlined algorithm. We investigate the performance of this iteration scheme using the simulated data set (results not shown) and find that iteration runs less than five times to converge, and mapping results are basically the same as those without iterations.

For detecting genetic variations associated with beef carcass traits and meat quantity traits, GWAS have been conducted in Korean Hanwoo cattle (Lee et al., 2010), Korean beef cattle (Kim et al., 2011) and Australian taurine and indicine cattle (Bolormaa et al., 2011). Many significant SNPs were identified using the simple linear regression and stepwise regression procedures. Bolormaa et al. (2010) carried out a multiple-trait GWAS for dairy traits using a PCA and a series of bivariate analyses. In this article, it is shown that multiple-trait GWAS has better statistical power to detect associations than single-trait GWAS and to identify additional associations without an increased false discovery rate. However, it did not increase the precision for the mapped QTL. Until now, no GWAS based on PCA has been reported for multiple beef carcass traits and meat quantity traits, and <50 000 SNPs were used in the previous GWAS in cattle. With a total of 630 000 SNPs in our study, it is expected that more biologically important SNPs are identified. This will largely improve our knowledge of the genetic architecture of beef traits and provide a valuable research tool for analyzing multiple traits in other GWAS.

Data archiving

Data available from the Dryad Digital Repository: 10.5061/dryad.mh77c.