ABSTRACT
BACKGROUND Metafounders are pseudo-individuals that condense the genetic heterozygosity and relationships within and across base pedigree populations, i.e. ancestral populations. This work addresses estimation and usefulness of metafounder relationships in Single Step GBLUP.
RESULTS We show that the ancestral relationship parameters are proportional to standardized covariances of base allelic frequencies across populations, like Fst fixation indexes. These covariances of base allelic frequencies can be estimated from marker genotypes of related recent individuals, and pedigree. Simple methods for estimation include naïve computation of allele frequencies from marker genotypes or a method of moments equating average pedigree-based and marker-based relationships. Complex methods include generalized least squares or maximum likelihood based on pedigree relationships. To our knowledge, methods to infer Fst coefficients and Fst differentiation have not been developed for related populations.
A compatible genomic relationship matrix constructed as a crossproduct of {−1,0,1} codes, and equivalent (up to scale factors) to an identity by state relationship matrix at the markers, is derived. Using a simulation with a single population under selection, in which only males and youngest animals were genotyped, we observed that generalized least squares or maximum likelihood gave accurate and unbiased estimates of the ancestral relationship parameter (true value: 0.40) whereas the other two (naïve and method of moments) were biased (estimates of 0.43 and 0.35). We also observed that genomic evaluation by Single Step GBLUP using metafounders was less biased in terms of accurate genetic trend (0.01 instead of 0.12 bias), slightly overdispersed (0.94 instead of 0.99) and as accurate (0.74) than the regular Single Step GBLUP. Single Step GBLUP using metafounders also provided consistent estimates of heritability.
CONCLUSIONS Estimation of metafounder relationship can be achieved using BLUP-like methods with pedigree and markers. Inclusion of metafounder relationships improves bias of genomic predictions with no loss in accuracy.
BACGROUND
The concept of metafounders gives a coherent framework for a comprehensive theory of genomic evaluation [1]. Genomic evaluation in agricultural species often implies partially genotyped populations, i.e. some individuals are genotyped, others are not, and phenotypes may be recorded in either of the two subsets. An integrated solution called Single Step has been proposed [2–4]. This solution proposes an integrated relationship matrix
with inverse
where G is the genomic relationship matrix, A is the pedigree-based relationship matrix, and matrices A11, A2, A21, A22 are submatrices of A with labels 1 and 2 denoting non-genotyped and genotyped individuals, respectively.
Because genotyped animals are not a random sample from the analyzed populations (they are younger or selected), it was quickly acknowledged that a proper analysis requires specifying different means for genotyped and non-genotyped individuals for the trait under consideration. These different means can be considered as parameters of the model, which are either fixed [4] or random [5,6]. In the latter case, the random variables induce covariances across individuals, a situation that is referred to as “compatibility” of genomic and pedigree relationships. In fact, compatibility implies comparability of the average breeding value of the base population and of the genetic variance [7] across the different measures of relationships.
Numerically, the problem shows up as follows. The formulae for matrix H and its inverse contain (G − A22) and (assuming G is full rank), respectively. This suggests that if G and A22 are too different, biases may appear.
Genomic relationships are usually computed in one of two manners: the “crossproducts” [8] or the “corrected identity by state (IBS)” [9]. Both depend critically on assumed base allelic frequencies (Toro et al., 2011). However, for most purposes allelic frequencies are not of interest per se and can be treated as nuisance parameters to be marginalized. Christensen [10] achieved an algebraic integration of allele frequencies, leading to a very simple covariance structure with allele frequencies in genomic relationships fixed at 0.5 (e.g., using genotypes coded as {−1,0,1} in the crossproducts) and a parameter called γ which describes the relationships across founders i.e. in the base population. A second parameter in Christensen’s marginalisation is s, which is a counterpart of the heterozygosity of the markers at the base population. Therefore, instead of inferring (thousands of) base allelic frequencies, inference can be based on two simple parameters γ and s. Both can be estimated maximizing the likelihood of observed genotypes. Also this considers the fact that pedigree depth is arbitrary and mostly based on historical availability of records.
Legarra et al. [1] showed the equivalence of Christensen’s ideas to metafounders: pseudo-individuals that simultaneously consider three ideas: (a) separate means for each base population [4,11], (b) randomness of these separate means [5] and (c) the propagation of the randomness of these means to the progeny [10], while accommodating several populations with complex crosses e.g. [12]. Legarra et al. [1] also generalized one relationship across founders (scalar γ) to several relationships across founders in the pedigree, i.e. ancestral relationships (matrix Γ), and suggested simple methods to estimate them. However, the performance of their model, both for estimation of ancestral relationships and for genomic evaluation, has not been tested so far.
This work has two objectives. The first one is to delve into the structure of the metafounder approach to find an alternative parameterization and estimation of the ancestral relationships. By doing so we find that ancestral relationships are generalizations of Wright’s Fst fixation index. The second goal is to test, by simulation, (i) methods to estimate ancestral relationship parameters, (ii) the quality of genomic predictions using metafounders and (iii) the quality of variance component estimation. For the second goal, the simulated population is undergoing selection and with a complete pedigree partially genotyped.
METHODS
Relationship between metafounders and allelic frequencies at the base
Single population
Let M be a matrix of genotypes coded as gene content, i.e. {0,1,2} and the genomic relationship matrix G = (M − J)(M − J)’/s with J a matrix of 1’s, with reference alleles taken at random so that for a random locus the expected allelic frequency p is 0.5. [10]. In other words, the matrix Z = (M − J) contains values of {−1,0,1} for each genotype. In a single population, let γ be a relationship coefficient across pedigree founders or, equivalently the self-relationship of the metafounder [1,10]. Parameter γ is the relationship coefficient among the founders of a population, so that G = (M − J)(M − J)’/s is most likely given the observed pedigree. This relationship γ is relative to a population with maximum heterozygosity and it is analogous to an Fst fixation index. The parameter s is a measure of maximum heterozygosity in the population.
Christensen (2012) estimated the two parameters, γ and s using maximum likelihood, whereas Legarra et al. (2015) suggested methods of moments. Closer inspection of Appendix A in Christensen (2012) leads to the following developments (see supplementary material for more details).
The parameter γ is such that with pi = 1 − qi the allelic frequency at a random locus i. The parameter s = n(2Var(pi) + E(2piqi)) with n being the number of markers. However, E(2piqi) = 2E(pi)E(qi) − 2Var(pi) = 0.5 − 2Var(pi), where it was used that if alleles are labelled at random across loci then E(pi) = E(qi) = 0.5. From this it follows that
and the genomic relationship matrix is G = 2(M − J)(M − J)’/n. Interestingly, this matrix is similar to a matrix of IBS relationships, that can be written as GIBS = (M − J)(M − J)’/n + 11′, so that
. (See proof in the supplementary Material).
so that γ for a single population is eight times the variance of allelic frequencies at the base population (this variance was described by Cockerham [13]). These equalities were not described in Christensen [10]. We stress that
to imply that
(and γ) is a parameter, the variance of allelic frequencies [10,14–16]. On the other hand, s can be seen as the heterozygosity in the case that all markers had an allelic frequency of 0.5.
Multiple populations
In an analogous manner, the relationship across two metafounders b and b′ is
i.e., the covariance across loci between allelic frequencies of two populations b and b′. This is almost tautological: the relationship is the covariance across gene contents at a locus, here applied for populations. Christensen et al. (2015) show this in Appendix A, somehow implicitly. Cockerham [13] and Robertson [17] interpret
as the coancestry across two populations and Fariello et al. [18] use
, to describe the divergence of populations. There are several measures of genetic distance between populations (e.g. [19]), and most of them contain a term related, implicitly or explicitly, to
,. In particular, the average square of the Euclidean distance can be written as
.
Estimation
Estimation in a single population
Estimation of s is trivial, it is simply half the number of markers. Parameter γ is proportional to the variance of allele frequencies. If base population individuals were genotyped, computing allele frequencies and estimating γ is trivial. In the next section we propose methods when this is not the case, i.e. genotyped individuals are related and perhaps several generations away from the base.
1-Assuming no pedigree structure
NAIVE: The simplest model assumes that genotyped individuals are unrelated and constitute the base population. For locus i, let mi be a vector of gene contents in the form {0,1,2}, defined as before. The mean of this vector is μi = 2pi. For each locus, estimate μi as the observed mean of mi, then compute as the empirical variance across loci of
, and because
and
.
2-Considering pedigree structure
At locus i, gene content can be seen as a quantitative trait where the mean of mi in the base population is 2pi, where pi is the allelic frequency at the base population, and the genetic variance is 2piqi [20]. Cockerham (1969) showed that the covariance of gene content of marker i across individuals j and k is a function of relationship Cov(mi,j, mi,k) = Ajk2piqi. A linear model can therefore be written as:
where W is an incidence matrix relating individuals in pedigree to genotypes, and with ui being the deviation of each individual from the mean μi for all individuals (Gengler et al., 2007; Forneris et al., 2015). Assuming multivariate normality:
Equivalently, for the set of genotyped individuals (labelled as “2”), u2,i~ N(0, A22(2piqi)) where A22 = WAW′ is an additive relationship matrix spanning only the genotyped individuals. From this formulation, there are two possible strategies to estimate
.
Generalized Least Squares (GLS). This ignores the prior distribution of μ and estimates each μi as a “fixed effect” using for each locus separate BLUP (or, equivalently, GLS) estimators of μi. One option is to use the complete A−1 and mixed model equations [20,21]. Equivalently, the corresponding GLS expression is
where (
) is the sum of elements of
and
is simply a weighted sum of genotypes. Then, estimate
as
and because
, and it follows that
.
Maximum likelihood (ML). Actually (and more exactly), μi can be considered as drawn from a normal distribution, . Thus
is a variance component that can be estimated by Maximum Likelihood. The equations for given values of
and
are
. An Expectation-Maximization scheme [22] is as follows. Pick starting values for
. Iterate until convergence on:
For each marker i,
estimate
store
update
as
with
Update
as
, where the second part of the expression corresponds to the trace Tr(IC), I, the identity matrix, is the relationship across μ and C is the prediction error covariance matrix of
. As only the diagonal elements of C are needed, the elements
can be obtained separately from each single locus analysis.
On convergence, the estimate is . This gives the same estimate as the method based on a Wishart likelihood function in Christensen (2012) with s = n/2 (results not shown).
Estimation in multiple populations
If t base populations are considered, the variance component generalizes to Σ0, a t×t matrix of variances and covariances across means
for marker i in population b. Across different populations,
.
1-Assuming no pedigree structure
NAIVE If relationships across individuals are ignored:
where Q is a matrix allocating individuals to populations and μi is a vector with t elements including each population average. For each locus, μi can be computed using least squares and the covariance matrix of μi across loci gives an estimate of
.
2-Considering pedigree structure
If there are no crosses, the estimation of allelic frequencies can be split in separate analysis by population with
, and Ab is the matrix of relationships concerning population b. Then,
is estimated as the observed matrix of covariances across loci for estimated
. If there are crosses, there are two alternatives.
GENERALIZED LEAST SQUARES (GLS). The first alternative, suggested by Forneris et al. (2015) is to use a genetic groups model [11,23], as mi = Qμi + Wui + e where Qk,b contains the fraction of ancestry b in individual k. This ignores the fact that the variance of gene content, (2piqi) is different for each breed and cross. The second, and more exact alternative is to use the representation where the breeding values are split into within and across breed components (Garcia-Cortes and Toro, 2006), as
with partial relationship matrices for vectors ub, ub,b′.
MAXIMUM LIKELIHOOD (ML). Analogously to the single population case, an Expectation-Maximization updated estimate can be obtained using multiple trait formulations [22] where PEC is the prediction error variance-covariance, e.g. with two populations:
Our current implementation is as follows:
For each marker i,
estimate
store
update
as
with
Update Σ0 using crossproducts within and across populations as e.g. with two populations,
There is an approximation in (1c) because we assume that is equal across all base populations. This point will be addressed in future research.
SIMULATION
To assess the quality of genomic predictions using one metafounder, we simulated data using QMSim [24]. The simulation closely followed Vitezica et al. (2011) to mimic a dairy cattle selection scheme scenario. A historical population undergoing mutation and drift was generated, followed by a recent population undergoing selection.
First, 100 generations of the historical population were generated with an effective population size of 100 during the first 95 generations, followed by a gradual expansion during the last 5 generations to an effective population size of 3000. In total 30 chromosomes of 100 cM and 40,000 segregating biallelic markers distributed at random along the chromosomes in the first generation of the historical population were simulated. The 40,000 markers were resampled from a larger set of 90,000 markers in order to obtain allelic frequencies from a beta(2,2) distribution, similar to dairy cattle marker data, so that true γ had a value around 0.40. Potentially, 1500 QTL affected the phenotype; QTL allele effects were sampled from a Gamma distribution with a shape parameter of 0.4. The mutation rate of the markers (recurrent mutation process) and QTL was assumed to be 2.5 × 10−5 per locus per generation (Solberg et al., 2008). A female trait with a heritability of 0.30 was simulated.
Then, 10 overlapping generations of selection followed, where 200 males were mated with 2600 females producing 2600 offspring following a positive assortative mating design. Within the simulation, individuals were selected according to estimated breeding value (EBV) based on pedigree BLUP. In each generation 40% of the males and 20% of the females were replaced by younger and selected individuals. No restrictions were set to avoid or minimize inbreeding, so highly inbred individuals were found, as a result of extreme selection and matings among highly related individuals. There were 100 individuals (mainly found in the last generation) with an inbreeding coefficient higher than 0.20, with extreme cases (few individuals) with inbreeding coefficients higher than 0.40. True breeding values (TBV) and pedigree information were available for all 10 generations (28,800 individuals in pedigree), phenotypes were available for all females except the last generation (14,300 records). All males (840 sires of females with phenotypic records) were genotyped as well as 2600 individuals in generation 9 (with records) and 2600 in generation 10 (with no records). All in all, 20 independent replicates were made. A two-step analysis was carried out using the simulated data. First, we compared several methods to estimate γ. Then, we tested the quality of genomic predictions using four methods, one of them including one metafounder.
Methods to estimate Gamma
Parameter γ was estimated using four different estimation methods. First, the NAIVE method which does not consider the pedigree structure. Then, the genealogical information was included in the estimation by three different methods: GLS, ML, and the Method of Moments (MM) presented in Legarra et al. (2015). For a single population, the last method involves the estimation of γ based on summary statistics of A22 (regular pedigree-relationship matrix for genotyped individuals) and G (the genomic relationship matrix).
Genomic prediction methods
Genetic merit of the selection candidates in generation 10 (genotyped and with no phenotype records) was estimated using four methods. The first one was the pedigree based BLUP (PBLUP) based on phenotype and pedigree information. The second method was Single-Step GBLUP (SSGBLUP) in which genomic information is also taken into account; this method used the correction of [25] and is the default method used in most practical applications [25,26]. However, the implementation that we used does not include inbreeding in the setup of A−1 [27], although it does consider it in (see below for use of these matrices). The third method was Single-Step GBLUP including inbreeding in the setup of A−1 and of
(SSGBLUP_F). Finally, the fourth method was SSGBLUP including the metafounder (SSGBLUP_M), using γ estimated by GLS as it turned out to be an accurate method to estimate gamma (see the Results section). The three methods used the following inverse relationship matrices: PBLUP: A−1; SSGBLUP:
where Ga is as in [25]; SSGBLUP_M:
where G = (M − J)(M − J)’/s with s = n/2 (see the Methods section) and A(γ) is as in [1]. More details are given in Supplementary material. For computation we used blupf90 [28]. In the case of SSGBLUP_M we constructed all relationship matrices with own software, and then used the option user_file in blupf90.
Quality of genomic prediction
Prediction quality was checked for all 2600 selection candidates. The accuracy of the methods was measured as the Pearson correlation between TBV and EBV. Bias was calculated as the difference between the average TBV and average EBV with respect to the base population. Thus, bias is related to estimated genetic progress in the selection candidates. The inflation (often called bias) of the prediction method was quantified by the coefficient of regression of TBV on EBV. These two statistics corresponds to the coefficients b0 and b1 in the Interbull validation method [29] which uses the regression TSV = b0 + b1ESV + e. The mean square error (MSE) was calculated as the mean of the squared difference between TBV and EBV. An ideal method should have maximum accuracy, minimum MSE, zero bias and a regression coefficient of 1. These are not only nice statistical properties but also have relevance in livestock selection [30–32]. Ranking changes of the selection candidates were also assessed by calculating the Spearman’s rank correlation coefficients between EBVs across methods.
In addition, the quality of variance component estimation was also assessed. For this purpose variance components were estimated using the four methods (PBLUP, SSGBLUP, SSGBLUP_F, SSGBLUP_M) using REML with remlf90 [28].
RESULTS
Estimation of gamma
Figure 1 shows boxplots of the differences between the estimates of γ calculated by four different methods (MM, Naive, ML and GLS) and the true values obtained by simulation, using each of the 20 replicates. The simulations were tailored to produce γ = 0.40. ML and GLS estimated γ very accurately. The MM clearly underestimated the value of γ, whereas the Naive method overestimated it. Based on these results we used the γ estimated by GLS when using SSGBLUP_M for prediction. The effect of employing different values of γ in the genomic prediction was assessed to quantify its impact in terms of the quality of predictions. Using estimates of γ based on the Method of Moments only slightly changed the results (not shown).
Differences between estimated and true Gamma, across 20 simulation replicates. Gamma was estimated by Generalized Least Squares (GLS), Maximum Likelihood (ML), Method of Moments (MM) and the Naive method.
Quality of genomic prediction
Correlations between TBV and EBV for each of the prediction methods are shown in Table 1 and Figure 2a. Compared with PBLUP, SSGBLUP_F and SSGBLUP_M increased accuracy by approximately 23 absolute points, respectively. This shows an important improvement by including marker information in the prediction and the possibility of generating a small extra gain when also including the metafounder. SSGBLUP resulted in a small loss of accuracy as compared to SSGBLUP_F and SSGBLUP_M.
Accuracy (correlation between TBV and EBV), inflation (regression coefficient of TBV on EBV), bias (average (EBV-TBV)) and mean square error (MSE) for each of the prediction methods. Standard deviations in parenthesis.
a. Correlation of TBV on EVB for each prediction method (accuracy). b. Regression slope of TBV on EBV (overdispersion). c. Bias (average (EBV-TBV)).
Bias values for each prediction method are shown in Table 1 and in Figure 2c. Both PBLUP and SSGBLUP_M were unbiased, whereas SSGBLUP and SSGBLUP_F were biased. Bias in SSGBLUP_F is equivalent to roughly 0.5 generations of genetic improvement or to 0.4 standard genetic deviations.
Table 1 and Figure 2b display the regression coefficient of TBV on EBV. This value measures the inflation degree of each prediction method and should be close to 1. PBLUP and SSGBLUP_F produced the values closest to one. Including genomic data in the prediction using SSGBLUP resulted in regression coefficients lower than one, but including the metafounder in SSGBLUP_M gives values closer to one. SSGBLUP_M and SSGBLUP_F displayed a lower standard deviation compared to the other two methods. Again, SSGBLUP showed the highest variability. SSGBLUP_M displayed the lowest MSE (closer to zero), followed by SSGBLUP_F (Table 1).
Variance components estimation
Figure 3 shows the estimates of heritability obtained in three of the four methods assed (PBLUP, SSGBLUP_F and SSGBLUP_M). The estimates obtained using SSGBLUP are not displayed in Figure 3 because in 6 out of the 20 simulation replicates EM-REML did not converge. Convergence was achieved in those cases by weighting the submatrix in H−1 by ω = 0.7 instead of 1 [33] but poor quality estimates were obtained and they are not reported.
Estimated heritability for PBLUP, SSGBLUP_F and SSGBLUP_M considering the 20 replicates. The dotted line shows the simulated heritability of 0.30.
When comparing with the simulated true heritability value (0.30) the scenarios displayed in general lower estimates. The lowest estimates were obtained using SSGBLUP_F. Including the metafounder improved estimates compared to SSGBLUP_F and reduced variability when comparing to PBLUP.
Ranking
The methods were also compared based on ranking correlations of EBVs with TBV and across methods. A rank correlation of 1 implies that the same candidates are selected. Results are in Table 2. Rank correlations with TBV are similar to accuracies in Table 1. Selection decisions are only slightly different using SSGBLUP, SSGBLUP_F or SSGBLUP_M. Note however, that this table does not address the comparison across generations (e.g. old vs. young animals), which is sensitive to biases reflected in Table 1 [32].
Spearman correlation among TBV and the four EBV for each of the prediction methods. Standard deviations in parenthesis.
DISCUSSION
In this work, we have addressed the complex issue of conciliation of marker and pedigree information. Powell et al. [34] argued that both IBS (at the markers) and IBD are measures of identity at causal genes and they are compatible notions. However, the incompatibility issue appears when mixing both kind of relationships [5,25,35,36]. Legarra [7] established how to solve the issue of comparing genetic variance across IBD, IBS or other measures of relationships. In this work, we have used, similar (but not identical) to Powell et al. [34], a fixed reference (G constructed as a crossproduct of {−1,0,1} genotypic codes) and tailored A (IBD, pedigree) to fit G (IBS, markers). Using a fixed reference has the advantage, compared to previous approaches, that genomic relationships are immutable (adding more genotypes to the database does not change the existing relationships) and they are unconditional on pedigree depth, that by construction is always limited and, in animal breeding, often heterogeneous. Our approach is in fact very similar to considering, as measures of identity, plain IBS. We use a matrix G = 2(M − J)(M − J)’/n, whereas a matrix of IBS, or molecular, relationships is GIBS = G/2 + 11′ (see proof at the supplementary material). In a GBLUP context when all animals are genotyped, using a model with IBS coefficients yields identical results as the term 1/2 gets absorbed into the variance component and the constant 11′ gets absorbed into the fixed part of the linear mixed model [7,37]. However, the matrix that must be used in SSGBLUP_M is G and not GIBS, because GIBS is not compatible with pedigree relationships.
Easy estimation of ancestral relationships
The derivations in the THEORY section show that estimation of ancestral relationships in γ (one base population) and Γ (several base populations) may be framed within the linear model approach that is classical in quantitative genetics [13], and recently used for gene content [12,20,21]. These methods are easy to understand and to compute. Also, Γ can be understood, just like heritability, as an unobserved base population parameter that does not change with additional data (although its estimate may change). Therefore, an accurate estimate of Γ can be used repeatedly without the need of re-estimation, as is customary in livestock genetic evaluations. This contrasts with “centering” of marker covariates, which changes with every new genotype.
In the current research, the simplest methods (Naive and Method of Moments) yielded biased (upwards and downwards respectively) estimates of γ; for the first method because it ignores that allele frequencies drift to the extremes as generations go, and for the second because it implicitly assumes that individuals genotyped are a random sample from a particular generation when in fact they are not.
In addition, the equivalence of ancestral relationships with second moments of allele frequencies shows a strong relation with populations genetics theory, which will be detailed in the next paragraph.
Relationship between metafounders γ and Fst fixation index
The fixation index Fst [38] is a measure of diversity of a set of populations with respect to a reference population, usually the pool of all populations. In this view, each population is a random sample from all possible populations that could be sampled according to the evolutionary process described by Fst. Conceptually, Fst is a parameter to be estimated [13,39], and it is not a statistic computed from the data. A usual definition of Fst for a particular biallelic locus is
where
is the variance of allelic frequencies across populations and
is the allelic frequency of the conceptual combined population. If we consider that the variance of allelic frequencies applies across loci and not across populations, it follows naturally that
. In this case,
Our interpretation is as follows. Jacquard (1974) called
the “inbreeding coefficient of a population”. Cockerham (1969) modelled
as an intraclass correlation, “the coancestry of the line with itself”, in other words, the probability that two gametes taken at random from the line are identical. Thus, it makes perfect sense to consider that the additive relationship (which is twice the coancestry value) of a group with itself is
. This is the interpretation of the
coefficient in Legarra et al. [1]. Note that the assumption
is automatically fulfilled if reference alleles are labelled randomly across loci (i.e., they are neither the most frequent nor the least observed).
Alternatively, Legarra et al. (2015) showed that for a population with self-relationship of γ, the average heterozygosity was , i.e. the variance is reduced by an amount of θ from the conceptual population with heterozygosity 1. Thus
can be interpreted as Fst if the latter is taken as a measure of homozygosity.
Consequences of using metafounders in genomic evaluation
Genomic estimates of breeding values are invariant to allele coding [37] when all individuals are genotyped. However, this is not the case when pedigree and marker information are combined as in SSGBLUP. In this work we have shown that, even in presence of complete pedigree and a single base population, use of metafounders in SSGBLUP_M leads to slightly more inflated, less biased EBVs, lower MSE and nearly unbiased estimates of heritability compared to SSGBLUP_F. Bias, defined as E(EBV-TBV)), is typically overlooked in genomic predictions, but in an example of biased evaluation “sires of later generations appeared to be under-evaluated relative to older sires” [40]. Overdispersion, also called bias in recent literature (e.g. Mantyssari et al., 2010), may have dramatical impact as well [30–32]. The trade-off between bias and variance needs further studies. For instance, [5] found that SSGBLUP_F was unbiased but had some overdispersion; this is likely dependent on the data structure, including the genotyping.
In addition, use of metafounders allows a clear definition of genomic relationships. With this definition, relationships are not dependent on pedigree depth or completeness, and are not dependent on allelic frequencies subject to change with arrival of new data. Additionally, a high dimensional parameter (-base- allele frequencies) is substituted by a low-dimensional one (matrix Γ).
The poor performance of SSGBLUP as compared to SSGBLUP_F (the former ignoring inbreeding in the set up of A−1) is likely due to the presence of highly inbred individuals. This relates to the interpretation of an ω parameter used in early studies of SSGBLUP. An application of SSGBLUP for type traits in Holstein [33] experienced convergence problems. The authors found that by multiplying by a ω = 0.7 eliminated convergence problems and increased accuracy. However, the nature of that parameter was not known, e.g. Misztal et al. [41]. In those studies, the inverse of the numerator relationship matrix A−1 was constructed using Henderson’s rules while ignoring inbreeding [27], while the submatrix
included inbreeding. Subsequently, the elements in the latter were too large. In addition, genotyped animals were on average unrelated in G but not in A22, which is corrected by scaling G as in Vitezica et al. (2011). But then, in
the elements were too large for younger animals relative to G. Both problems are partially circumvented but putting a weight ω < 1 on
When A−1 was constructed considering inbreeding, the optimal ω coefficient in an analysis of Holstein dairy cattle increased from 0.7 to 0.9 (Masuda, personal communication, 2016). However, the metafounder approach provides a clean solution to this problem. Also, following these experiences, A−1 should always be constructed considering inbreeding to avoid pathological problems.
CONCLUSION
Metafounders are similar to Fst fixation indices and proportional to covariances of allelic frequencies in base populations. Use of metafounders is simplified by new methods (GLS and maximum likelihood) to estimate the covariance of base allele frequencies. We verified by simulation of a selected population that, in a single population, both GLS and ML are unbiased and computationally efficient. In the same simulation, use of metafounders in Single Step GBLUP leads to more accurate and less biased evaluations, and also to more accurate estimates of genetic parameters.
We propose a genomic relationship matrix that refers to a population with ideal frequencies 0.5. This matrix is similar to an IBS relationship matrix (up to scale factors), does not change with new data and is compatible with pedigree data if metafounders are used.
In this simulated data, pedigrees are perfectly known. Future work with real data sets in more complex settings - purebreds and their crosses [42,43], and selected populations with unknown parent groups [11] will investigate the feasibility and accuracy in practice of using metafounders on Single Step GBLUP.
DECLARATIONS
Availability of data and materials: Software and files are avaible at https://github.com/alegarra/metafounders.
Competing interests
The authors declare that they have no competing interests
Funding
CAGB, SML and RJCC were partially funded by grants FONCyT PICT 2013-1661, UBACyT 861/2011 and PIP CONICET 833/2013. This work was partially financed by the AdMixSel project of the INRA SELGEN metaprogram (CAGB, AL and ZGV) as well as INP Toulouse (CAGB, AL). The project was partly supported by the Toulouse Midi-Pyrenees Bioinformatics platform.
Authors contribution
AL and OFC derived the theory with help from ZGV and CAGB. All authors agreed on scenarios to be tested. CAGB programmed and run all the simulations, with substantial input from IP and IM. The initial version of the manuscript was written by CAGB and AL and then completed by all authors.
Acknowledgments
We thank S Boitard and B Servin for discussions concerning Fst and all members of AdMixSel project.
APPENDIX
This Appendix contains several algebraic developments not detailed in the main text.
Analytical derivation of γ and s
For a particular population, the genetic variance-covariance structure is a function of two parameters η1 and and s = n(2η1 + η2) (n being the number of markers) which depend on the allelic frequencies (Christensen 2012), Appendix A. With pj being the allelic frequencies across the j = 1. n loci, these parameters do not depend on j and are equal to
with q = 1 − p.
Now use is made of the following developments.
Since we have that Var(p) = E(p2) − E(p)2 we obtain E(p2) = Var(p) + E(p)2. We also have E(q) = 1 − E(p). Substituting E(p)2 in (A1) gives
If markers are biallelic and labelled at random E(p) = E(q) = 0.5. So the equation above gives E(pq) = 0.25 − Var(p). From this we obtain
and therefore
or, in other words, s is half the number of markers. Further,
so that γ for a single population is eight times the variance of allelic frequencies at the base population.
Equivalences of genomic relationship matrices
The matrix G described in Christensen (2012) and in this paper can be written as , where M contains genotypes coded as {0,1,2} and J is a matrix of 1’s. The purpose of this paragraph is to show the linear relationship of this matrix with a matrix describing identity by state coefficients (IBS), in fact
. The terms in GIBS are usually described in terms of identities or countings (i.e. Ritland, 1996; Toro et al., 2011; Nejati-Javaremi et al., 1997):
where Ikl measures the identity (with value 1 or 0) of allele k in individual l with allele l in individual j, and single-locus identity measures are averaged across n loci. There is an algebraic expression for this “counting”. Toro et al. (2011) expression (1), show that for biallelic markers, for a locus k (omitted in the notation for clarity):
for coancestry (half relationship)
of individuals i and j, where m/2 is the “gene frequency” of the individual (half m the gene content, i.e. {0,1/2,1} for the three genotypes).
In order to prove , first we translate the Toro et al. (2011) equation to the more familiar scale of relationships
and gene contents m. Thus
This expression can be easily verified in a table with the nine possible genotypes:
Also,
which extends to all individuals and averaged across loci can be written as
Thus, matrix and because
it follows that
. The equivalence can also be verified by noting that, for all nine genotypes, the cross-product (mi − 1) in the following table is identical to
in the previous table.
Computation of the different H matrices
For SSGBLUP and SSGBLUP_F, matrix H−1 is constructed as follows:
with
, and
as in VanRaden (2008), M contains genotypes coded as {0,1,2} and P contains twice allelic frequencies pi. These are computed from the observed genotypes so that 2pi is equal to the the mean of the i-th column of M. Constants a and b are such that the full-matrix and diagonal averages of Ga and A22 are the same (Christensen et al., 2012) in order to make the two matrices compatible. The use of the weights 0.95 and 0.05 is in order to make Ga invertible. Matrix A−1 should be constructed using contributions with values described in the Table below (i.e. Meuwissen and Luo, 1992):
Or, in a more compact way with Funknown = −1.
SSGBLUP uses the defaults in blupf90 suite of programs (random_typeadd_animal). SSGBLUP uses the simple method to create
A−1, method which pretends that in all cases inbreeding in expressions above is F = 0.
SSGBLUP_F uses H−1 as above but constructs A−1 correctly (blupf90 random_type add_an_upginb), using the rules above.
SSGBLUP_M uses the blupf90 random_type user_file to consider the following relationship matrix:
with
(basically to make G invertible),
and s = n/2, M contains genotypes coded as {0,1,2}, n is the number of markers, A(Г)−1 and
are constructed with own programs as in Legarra et al. (2015) using the estimated value of Γ. Inbreeding is fully considered in both matrices.