Abstract
We developed an integrated R library called BWGS to enable easy computation of Genomic Estimates of Breeding values (GEBV) for genomic selection. BWGS relies on existing R-libraries, all freely available from CRAN servers. The two main functions enable to run 1) replicated random cross validations within a training set of genotyped and phenotyped lines and 2) GEBV prediction, for a set of genotyped-only lines. Options are available for 1) missing data imputation, 2) markers and training set selection and 3) genomic prediction with 15 different methods, either parametric or semi-parametric.
The usefulness and efficiency of BWGS are illustrated using a population of wheat lines from a real breeding programme. Adjusted yield data from historical trials (highly unbalanced design) were used for testing the options of BWGS. On the whole, 760 candidate lines with adjusted phenotypes and genotypes for 47 839 robust SNP were used. With a simple desktop computer, we obtained results which compared with previously published results on wheat genomic selection. As predicted by the theory, factors that are most influencing predictive ability, for a given trait of moderate heritability, are the size of the training population and a minimum number of markers for capturing every QTL information. Missing data up to 40%, if randomly distributed, do not degrade predictive ability once imputed, and up to 80% randomly distributed missing data are still acceptable once imputed with Expectation-Maximization method of package rrBLUP. It is worth noticing that selecting markers that are most associated to the trait do improve predictive ability, compared with the whole set of markers, but only when marker selection is made on the whole population. When marker selection is made only on the sampled training set, this advantage nearly disappeared, since it was clearly due to overfitting. Few differences are observed between the 15 prediction models with this dataset. Although non-parametric methods that are supposed to capture non-additive effects have slightly better predictive accuracy, differences remain small. Finally, the GEBV from the 15 prediction models are all highly correlated to each other. These results are encouraging for an efficient use of genomic selection in applied breeding programmes and BWGS is a simple and powerful toolbox to apply in breeding programmes or training activities.
Introduction
The use of molecular markers to provide selection criteria for quantitative traits was first proposed by [1]. They introduced a theory for optimizing weights to be given to each marker associated with a QTL, and they demonstrated that this index was at least as efficient as the phenotypic score for the genetic improvement of a population by truncation selection. This marker-assisted-selection (MAS) approach used only markers which had been previously associated with a QTL through statistical analysis. Therefore, the number of markers remained limited, and their effects were usually estimated by solving the linear model equations. The efficiency of MAS versus phenotypic selection is higher when the trait has a low heritability, the population size is large and the QTLs explain a large proportion of the trait variation. It was shown that the use of this marker index would facilitate early selection, bypassing a trait evaluation step and thereby shortening selection cycles [2]. Consequently the genetic gain per cycle would increase. However, QTL detection is often limited in practice by experimental power, particularly by the size of the studied population, and QTL with small effects often remain undetected. Therefore, this “missing heritability” can reduce the efficiency of MAS. Subsequent studies have shown that MAS efficiency is improved when more QTLs with small effects are included [3–4], and this implied relaxing the stringency threshold of significance to allow more true QTLs being detected, despite the risk of having some false positive ones being included. Extending this reasoning, it was proposed to include all markers in the selection index, thus bypassing the QTL identification step [5]. As the number of markers is generally larger than the number of phenotypic observations, classical fixed effects regression are intractable (n>>p problem). Therefore, [5] suggested using ridge regression models to overcome this over-parameterization problem. Soon after, [6] applied ridge and Bayesian regression models to animal populations for predicting breeding values, and called this approach Genomic Selection (GS). Marker effects are first estimated from the genotypic and phenotypic data in a training population. Then marker effects are used to predict breeding values in the target population with only genotypic data, and selections are based on these Genomic Estimates of Breeding Values (GEBV). This method has been used successfully for dairy cow breeding [7]. Indeed, in the case of dairy cow breeding (and particularly for bulls), the advantages of GS over classical breeding are obvious. 1) genotyping is much cheaper (a few 10s US$ for the 55K SNP bovine chip) than progeny testing (since bulls do not give milk), 2) genotyping can be done at birth time, while progeny testing requires >7-8 years (until bulls give daughters and daughters give milk). Therefore GS of dairy bulls allowed early selection on a larger population, thus leading to nearly doubling the genetic gain per unit of time while the costs of proving bulls were reduced by 92% [8].
To benefit from shorter cycles and increasing selection intensity, the genetic gain per cycle should be close to that expected from phenotypic selection (PS). This relative efficiency of GS vs PS thus relies on the ability of predicting the observed genetic value from the marker genotype. GEBV predictive ability is usually measured by the correlation between the predicted and observed values. Most reported studies on GS in plants have focused on measuring the accuracy of genomic predictions, usually assessed by cross-validation techniques [9–14].
In the native theory, all available markers should be used without prior selection, since the statistical models were supposed to cope with the big data problem and avoid over parametrization. However, with the advance of molecular biology, the number of markers, e.g. single nucleotide polymorphisms (SNP), can be extremely high (up to some millions), particularly when compared to a few hundreds or thousands phenotypic observations. In this case, it seems reasonable to limit the number of markers to avoid too high over parametrization or over-representation of non-informative genomic regions, as well as speeding up computation. For example, one may wish to discard markers when they are in complete (or nearly complete) LD with another, thus bringing the very same information, or selecting markers which are evenly spaced, either physically or genetically, along the genome map.
Molecular data often contain missing value, particularly with the so-called genotyping-by-sequencing (GBS), likely because the fraction of the genome which is re-sequenced is not exactly the same from one individual to another [15]. Most prediction models do not accept missing data, therefore an imputation step is necessary to replace missing values, and various methods have been proposed to achieve the best guess (e.g. fastphase [16]).
Finally, imputed data of (possibly selected) markers are used to predict GEBV using phenotypic observation in a training population. Several methods have been proposed to achieve GEBV prediction. They can be classified into parametric vs semi-parametric methods [17]. In the R environment [18], which is often used in research, several libraries have been specifically developed for genomic selection, such as BGLR [19] or rr-BLUP [20]. However, few of these packages proposes functions to successively achieve the three described steps of marker selection, genotype imputation and model prediction.
The objectives of this manuscript are 1) to describe an integrated software (pipeline) which has been developed from open-source R functions available in various R libraries to enable the three steps to be performed easily and 2) to present an application of this software to carry out genomic predictions using historical data from a bread wheat breeding programme.
Materials and Methods
The BWGS pipeline
In the framework of the French flagship programme BreedWheat (www.breedwheat.fr), we developed an integrated pipeline based on R [18] called BWGS (BreedWheat Genomic Selection pipeline). BWGS comprises three modules: 1) missing data imputation, 2) dimension reduction, for reducing the number of markers and/or training individuals and 3) Genomic Estimation of Breeding Values (GEBV) with a choice among 15 parametric and non-parametric methods.
The pipeline comprises two “main” functions (Fig 1)
The first function called bwgs.cv is using both genotyping and phenotyping data from a “training” set or reference population to carry on model calibration and cross validation. Data are randomly split into n “folds”, and n-1 folds are used for training models and predicting the nth one. Computation can be replicated p times, and correlation between GEBV and observed trait are computed for each fold and each replicate, enabling estimates of average and standard deviation of predictive ability (see [21]).
Once the “best” model has been chosen based on quality assessment, a second function, named bwgs.predict is used to build the BEST model using the whole training set (genotyping + phenotyping), then apply the model to the genotyping data of the target population to get GEBV of these new genotypes.
Candidate lines of the target population can then be ranked according to GEBV for single trait (truncation) selection.
Going into more details of the pipeline, the workflow comprises three main steps:
A step of (missing) genotyping data imputation. This option can be useful for sources of genotyping data such as GBS. The following options are available:
MNI: missing data are replaced by the mean allele frequency of the given marker. This imputation method is only suited when there are a few missing values, typically in marker data from SNP chips or KasPAR.
EMI: missing data are replaced using an expectation-maximization methods described in function A.mat of R-package rrBLUP [20]. This algorithm was specially designed by Poland et al (2012) for the use of GBS markers, which usually give many missing data which are roughly evenly distributed. However, it does not use physical map position, as do other more sophisticated software (e.g. Beagles, [22]). For imputing low density genotyping of a large population to high density available for only a subpopulation, i.e. many markers with many missing data, such software should be used before BWGS.
A step of dimension reduction, i.e. reducing the number of markers. This reduction could be necessary to speed up computation on large datasets, depending on computer resources available. The following methods are available
RMR: Random sampling (without replacement) of a subset of markers. To be used with the parameter “reduct.marker.size”.
LD (with r2 and MAP): enables “pruning” of markers which are in LD > r2. Only the marker with the least missing values is kept for each pair in LD>r2. To allow faster computation, r2 is estimated chromosome by chromosome, so a MAP file is required with information of marker assignation to chromosomes.
ANO (with pval): one-way ANOVA are carried out with R function lm on trait “pheno” Every markers are tested one at a time, and only markers with pvalue<pval are kept for GEBV prediction
ANO+LD (with pval and r2, MAP is facultative): combines a first step of marker selection with ANO, then a second step of pruning using LD option.
For research or teaching purposes, an option for randomly sampling individuals has been added, although it is little useful in practical breeding applications. Options for selecting a subset of the training population are:
RANDOM: a subset of sample.pop.size is randomly selected for training the model, and the unselected part of the population is used for validation. The process is repeated nFolds * nTimes to have the same number of replicates than with cross-validation.
OPTI: the optimization algorithm based on CDmean [23] to select a subset which maximizes average CD (coefficient of determination) in the validation set. Since the process is long and has some stochastic components, it is repeated only nTimes.
A step of model building and cross validation.
In the general case of genomic selection, the number of explanatory variables, i.e. markers, (largely) exceeds the number of observations, making the classical linear model equation unsolvable. In a review, classified most of the methods that have been proposed to overcome this “big data” problem, into penalized regression (to make them solvable) or semi-parametric methods. Moreover, regression can be solved either analytically as in ridge regression (equivalent to G-BLUP) or iteratively though Bayesian computations. Bayesian methods can differ by the prior density distribution of marker effects, which can be modified boundlessly. In their review [24] describe the main features (e.g. prior…) for 13 methods.
The options available for genomic breeding value prediction are:
GBLUP: performs G-BLUP using a marker-based relationship matrix, implemented through rrBLUP R-library. Equivalent to ridge regression (RR-BLUP) of marker effects.
EGBLUP: performs EG-BLUP, i.e. BLUP using a “squared” relationship matrix to model epistatic 2×2 interactions, as described by [25] using the BGLR library
RR: ridge regression, using package glmnet [26]. In theory, strictly equivalent to GBLUP.
LASSO: Least Absolute Shrinkage and Selection Operator is another penalized regression methods which yield more shrinked estimates than RR. Run by glmnet library.
EN: Elastic Net [27] which is a weighted combination of RR and LASSO, using glmnet library
Several Bayesian methods, using the BGLR library
BRR: Bayesian ridge regression: same as rr-blup, but Bayesian resolution. Induces homogeneous shrinkage of all markers effects towards zero with Gaussian distribution [24].
BL: Bayesian LASSO: uses an exponential prior on marker variances priors, leading to double exponential distribution of marker effects [28].
BA: Bayes A uses a scaled-t prior distribution of marker effects [6].
BB: Bayes B, uses a mixture of distribution with a point mass at zero and with a slab of non-zero marker effects with a scaled-t distribution [29].
BC: Bayes C same as Bayes B with Gaussian a distribution for non-zero marker effects[19] A more detailed description of these methods can be found in (http://genomics.cimmyt.org/BGLR-extdoc.pdf.)
Four semi-parametric methods
RKHS: reproductive kernel Hilbert space and multiple kernel MRKHS, using BGLR [30–31]. Based on genetic distance and a kernel function to regulate the distribution of marker effects. This methods is claimed to be effective for detecting non additive effects.
RF: Random forest regression, using randomForest library [32]. This method uses regression models on tree nodes which are rooted in bootstrapping data. Supposed to be able to capture interactions between markers.
SVM: support vector machine, run by e1071 library. For details, see LIBSVM: a library for Support Vector Machines https://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf
BRNN: Bayesian Regularization for feed-forward Neural Network, with the R-package BRNN [33]. To keep computing time in reasonable limits, the parameters for the brnn function are neurons=2 and epochs = 20.
The criteria used to estimate model’s quality are 1) the Pearson correlation between adjusted phenotype and GEBV, computed over the whole data set (i.e. by merging the value from the nFolds) and 2) the square-root of the mean-squared error of prediction RMSEP. These criteria are provided for each replicate (nTimes) as well as mean and standard deviation over replicates. A table summarizes phenotype, estimated breeding value and its standard deviation, as well as the coefficient of determination for each individual GEBV computed as CD=sqrt(1-(SD(GEBV)2/VAR(GEBV)), when the package used does provide estimate of variance and standard deviation of GEBV, as it is the case for BGLR.
Application to real wheat breeding: plant materials and phenotypic data
A training population was built by gathering historical data from the INRA-Agri-Obtentions breeding programmes. To obtain robust phenotypes, we kept data from lines which have been evaluated in multisite trials (> 6 locations), usually for 1-2 years. They were developed from inbred lines crosses, followed by 6-8 generations of self-pollination. Few selection is made during the first 3 generations (i.e. F2 to F4), which are harvested in bulk, then visual selection on simple traits is made on plants or rows at generations F4 and F5. Selected F5 families are multiplied in field plots, which allow a rough evaluation of yield. Then enough seeds are available and selected lines are put into the evaluation network, usually at the F7 generation, when breeding lines are nearly homozygous. Typically, about 300-400 pair crosses are made each year, 120-150 F7 lines are put into first year multisite trials, then 50-60 are evaluated for a second year, and the very happy few (1-3) are then entering the official registration trials in France, to hopefully become a commercial variety, 10-12 years after the initial crosses. On the whole, our database for field data had 77,176 records on 1715 genotypes, i.e. 45 single plot measurements per genotype, spread over 15 years (2000-2014), 10 locations in France and two managements (high vs low inputs). Of course these data are highly unbalanced, each genotype being evaluated, on average, during 1.5 years in 7.60 environments (site x year) with usually two replicates by management, and most connections in the design are between two successive years, with a few control varieties (usually four in each trial) being evaluated on a longer period. These figures are for grain yield, the most documented trait. The whole dataset is described in [34] and can be recovered at urgi.versailles.inra.fr (INRA Small Grain Cereals Network Phenotypic Trials dataset). To illustrate the use of BWGS, we will concentrate on grain yield in high input management, i.e. an estimate of the genetic and climatic potential in each environment.
These highly unbalanced raw data require a pre-treatment step to correct for non-genetic factors as much as possible. The following mixed model was applied to estimate “corrected” genotypic means (BLUE) using the mixed model in R (library lme4):
Y=lmer (yield∼geno + (1|year:site:trial:block) + (1|year:site:geno)), where : stands for nested effects, e.g. block within trial within year.
Where geno being the main genotypic effect (fixed) and all other effects being considered as random effects to be corrected for. Note that these genotypic “BLUE” are highly correlated (r=0.94) with BLUP estimates (i.e. when genotypes are also considered as random effects with identity matrix for modelling covariances). But BLUE are not shrinked, and thus can be more easily used in a second step involving mixed modelling.
The same model with geno as random effect was used to estimate variance components and rough estimate of broad sense heritability as σ ²g / (σ ²g + σ ²ge /nenv + σ²e /nplot) σ
Where σ²g is the genotypic variance component, σ²ge is the GxE variance component (i.e. year:site:geno) and σ²g is the residual variance. nenv is the average number of environments per genotype and nplot the average number of plots per genotype in the dataset used.
Out of these “HQ-phenotyped” lines, 760 were also genotyped with the TaBW280K SNP chip [35] and used for testing BWGS. After quality check, a matrix of 188,406 high quality, polymorphic markers was finally used. To illustrate BWGS pipeline, we used a subset of 60,912 markers with consolidated map position.
Cross Validation used randomly sampled 90% of the genotypes as the training set and the remaining 10% genotypes as the “validation set”. The resampling process was iterated 100 times to estimate an empirical mean and standard deviation. Since the true breeding value was unknown, predictive ability (PA) was measured by the Pearson correlation between GEBVs and the adjusted phenotypic values (BLUE) across all folds, i.e. the Hold accuracy, which is supposed to be less biased that a fold-by-fold estimate [36].
Results
Summary statistics of the breeding population
Genotypic data have been used to estimate the additive relationships matrix according to [37] with the A.mat function of the R package rrblup. After scaling on a 0-1 scale (1 on diagonal), the values can be regarded as estimates ofcoancestry coefficients, whose distribution is shown in Fig 2a. Omitting the diagonal, coanscestry coefficients range between 0.05 and 0.5, with a median around 0.18, i.e. slightly less than the value expected for half-sibs. The heat map of Fig 2b displays no clear-cut structure into distinct groups, although some groups along the diagonal are made of lines that are more related than average. Such absence of structure allows the use of random sampling cross-validation. A plot of principal component analysis of Euclidian distance matrix among the 760 lines (Fig 2c, R command cmdscale) again does not show any clear structure according to the year of first evaluation, thereby justifying again the use of random cross-validation over years.
Imputation of missing data
The first step of data cleaning with MAF=5% and maximum marker missing data = 20% led to retain 47,839 markers out of 60,912. Since adjusted Yield BLUE were complete for the 760 genotypes, the final dimension of geno matrix was 760 x 47,839. This SNP dataset contained on average 0.93% of missing data. Given the small proportion of missing data, no difference in predictive ability was observed between the two imputation methods. To test the efficiency of imputation methods on prediction accuracy, we generated new genotypic matrices with 20%, 40%; 60% and 80% of missing data, randomly distributed in the dataset. Imputation by mean allele frequency took 0.15 minute, while imputation by the EM algorithm of A.mat took around 15 minutes, whatever the proportion of missing data. Predictive ability for grain yield from imputed dataset using the two methods is presented in Fig 3. The two methods give similar results for up to 20% missing Data, then imputation by EMI allows higher predictive abilities than imputation by the mean allele frequency. When 80% missing data are randomly generated, the predictive ability of EMI-imputed set is still 0.418, while it drops to 0.329 with the MNI-imputed set.
Effects of sampling markers
Table 1 shows the number of markers selected by either one-way ANOVA at different pvalue thresholds, or by pruning markers with LD > threshold values from 0.5 to 0.98. For ANOVA selection, two strategies have been used, namely 1) GWAS was carried out only once using the whole dataset, i.e. training + validation lines and 2) GWAS was carried out using training lines only, i.e within each time x fold replicate. Predictive abilities achieved with the selected marker subsets are also shown in Table 1.
Results are also displayed in Fig 4.
As expected, predictive ability increases with the number of randomly sampled markers, with a steady plateau above around 5000 markers and a best predictive ability of 0.525. This clearly shows that the extent of LD is large enough in such elite breeding lines to allow most QTLs for yield being captured with 5000-10000 markers, i.e. on average one every 1.7 Mbase. This is illustrated by marker selection based on LD-pruning, i.e. removing markers which are in LD > threshold with any other marker. The marker with the least number of missing value is conserved, otherwise the choice is random. The highest value of 0.544 is obtained with 4462 markers with pairwise LD >0.8. When marker selection is based on ANOVA carried out on the entire population, there is an optimum predictive ability of 0.614 achieved with 1820 markers (pvalue = 0.0001). However, these results were only found when marker selection is made only once by association analysis on the whole dataset, which is then split into a training and a validation set. When marker selection is made within the loop only in the training set, then the predictive ability of the validation set is only slightly improved for N= 2000 and 5000 markers, compared to random sampling, but never exceeds the predictive ability achieved with the whole set of markers. This clearly illustrates that overfitting does occur in the first case.
Effect of different calibration set sampling strategies on prediction accuracy
For this study, we used the whole set of 47,839 markers; which gave a predictive ability of 0.525 in 10-fold cross validation using the complete dataset of 760 breeding lines, i.e. with a training set of 684 lines and a validation set of 76 lines. Predictive ability was estimated with GBLUP using calibration set of size CS= 50, 100, 200, 300, 400, 500, 600 and 700, validation being made using the remaining lines. Fig 5 shows the increase of both predictive ability and computing time as the size of training set increases. Computing time is given as the proportion of that needed for 100 replicates x 10 folds with TS = 700, which was roughly 60 minutes. As expected, predictive ability does increase with the size of calibration set, while its variability decreases, with the notable exception of TS = 700. This may be due to the small size of the validation set, which is only 76 with TS = 700, making correlations between phenotype and GEBV more erratic than those observed with larger validation sets.
Two strategies were compared as illustrated in Fig 6: 1) the random sampling among the 760 lines as in Fig 5 and 2) selecting an optimized subset with the CD-mean criteria as described by [23]. Although this algorithm should be deterministic and give always the same subset, some stochasticity remains in the drop-replacement procedures, which explains that predictive ability still have some residual variation as illustrated by the error-bar. Fig 6 shows that the optimization algorithm does improve predictive ability for small-medium size of the training set compared to random sampling. as predicted by the theory and already reported [23]. This advantage disappears when the proportion of sampled individuals increases.
Efficiency of various prediction methods
With again the whole set of 47,839 markers, we used all prediction methods available in BWGS to estimate prediction accuracy in 100 independent 10-fold cross validations. Computing time varied considerably from one method to another, as illustrated in Fig7. BRNN is, by far, the most demanding methods, with 2214 minutes (nearly 4 days) needed to carry out 100 replicates of 10-folds cross validation, while the least demanding is GBLUP with 28 minutes. Of course these values must be taken only for comparison, as they are highly dependent on the computer characteristics. Note that EGBLUP, which is an extension of GBLUP with epistatic interactions being modelled by the product of additive relationship matrix with itself, takes 344 minutes instead of 28. The other computationally intensive methods are multi-kernel RKHS (637 minutes) and random Forest RF (382 minutes).
In Fig 7, we discarded results from SVM methods, which gave very poor predictive ability (0.25), although in a reasonable time of 31 minutes. Predictive abilities of the other 14 methods range from 0.475 to 0.543, with no relationship with computing time, which can be considered as an estimate of method complexity. Note that the support vector machine SVM compared with other methods when 5000 random markers are used, but seemed to be unable to deal with 47,000 markers. LASSO and elastic net (from glmnet library) and BRNN give the worse predictive abilities around 0.48, while random forest regression RF gives the highest of 0.543, slightly above that of the reference GBLUP (0.525). Note that the three methods that outperform GBLUP are thought to take into account non additive marker effects. However, in this practical case of grain yield prediction, they did not show a dramatic advantage over GBLUP. BRNN is another model-free, machine learning method, often supposed to give more accurate prediction than linear regression methods. The relatively poor predictive ability observed in this study can be caused by insufficient computer resources allocated. Other parameterization (e.g. number of neurone layers, epochs…) may have given better predictive abilities.
Perhaps as important as predictive ability is the consistency of GEBV estimated by different methods. Fig 8 shows the correlation between GEBV (averaged over the 100 replicates) predicted by the 14 successful methods (omitting SVM). The minimum correlation of 0.85 is between random Forest and Bayesian ridge regression, while many pairwise correlation are close to 1. When omitting RF, which is the method whose prediction are least related to the others, all correlations are above 0.92, thus all methods can be considered as giving highly consistent prediction of GEBV.
Discussion
Genomic selection programmes are now routinely used in dairy cow breeding, and benefit from huge phenotypic data recorded in past years on milk production of thousands of females, usually related by well-known pedigree relationships (i.e. mother and father known without error). This is however not yet the case in other species like sheep, although effort are being developed in so-called minor species. Therefore, using a relatively cheap genotyping, dairy cow breeders usually have very large population for training GS models, which led to highly accurate predictions. In many animal studies, the oldest animals are used as training set and the youngest as validation set.
Contrasting with animal breeding, most plant breeding programmes do not have very large population sizes, although each breeding company manipulates hundreds or thousands of candidates, since companies are usually reluctant to share and merge datasets. Breeding companies have developed in-house biostatistical tools for calculating GEBV with semi-automated pipeline, since time is often short between data production (e.g. grain harvest) and selection decision (e.g. sowing next generation). There are also several publically available tools which have been developed, particularly as R libraries. Among the most popular we can mention glmnet [26], BGLR [19], rrBLUP [20] or Synbreed [38]. Recently, integrative packages have been developed, which rely on existing public R-libraries. An example of such packages are G2P and the one presented here called BWGS. G2P [39] proposes 17 prediction methods from 10 R-libraries, among which BGLR, glmnet and randomForest as in BWGS. As BWGS, there are also two main functions, G2PCrossValidation and G2P to apply prediction model to a test set with only marker data. Compared to BWGS, G2P offers much more options for tuning parameters of the numerous functions/libraries called by the two main functions. As a drawback, handling G2P appears more complex, and its complete use requires reading the notices of original libraries, since the notice of G2P does not provide enough details on the possible parameters and options. Moreover, although G2P does content a quality control module, it is not directly integrated into the main function as in BWGS. Comparatively, in BWGS, we have chosen to fix most internal parameters with defaults values, which have been tested to be adapted for medium size datasets as provided in the example (47 K SNP, 760 training lines), while maintaining computing time into reasonable limits for desktop computers. It is therefore easy to use, especially for beginners, and of course parameters can be modified quite easily in the source code to adapt larger datasets.
In our highly unbalanced breeder’s dataset, when discarding control lines which were regularly replicated over years, the average number of environment per studied lines is 10.4, with two replicates by environment (site x year). This led to an estimate of broad sense heritability of 0.76 and therefore a theoretical upward limit of prediction accuracy of 0.872. When using random cross validation and the whole set of 47,839 markers, the achieved predictive ability is 0.525, which would correspond to an accuracy of 0.603 according to formula in [21]. However we do not fully trust in this formula, since heritabilities are often poorly estimated. We do prefer keeping predictive ability as a criteria for comparing models and strategies.
Our results compare with previously reported predictive ability estimated through random cross-validation, for example 0.36-0.53 [40] and 0.32-0.59 [41] for grain yield in bread wheat with training population of a few hundreds lines, and up to 0.65 with a training set of 2325 elite European wheat lines [42]. The use of random cross validation seems to be justified, since no clear-cut structure appears in the set of lines. In particular, lines put into trial in a given year are not more related with each other than with lines put into trails another year. Then the predictive ability obtained in random cross-validation should be valid for any other set of lines showing a similar degree of relatedness than within the training set.
Effect of marker density and training population size
Although GS theory has been elaborated to cope with the over parameterization problem (number of markers >> number of observations), it empirically appears that adding more markers than needed does not improve predictive ability. This was already observed in many reports. Among other, a figure similar to our Figure 3 can be found in [43]. In this empirical study in maize, GS accuracy reach a plateau with 7000 randomly selected markers in a “natural” population, and with only 2000 markers in biparental populations. In a recent simulation study with high density coverage, the same authors even stated that the accuracy obtained using all SNPs can be easily achieved using only 0.5 to 1.0% of all markers [44]. This clearly illustrates that, once every QTL information is captured by one marker in LD, adding more markers is useless. This of course relates to the average linkage disequilibrium between adjacent markers. In our study, the material is made of hundreds of related families, each of small size, and the plateau is reached around 2000-5000 random markers, a value close to that observed for maize natural population, while [45] reported that 256 markers were enough to achieve maximum accuracy in wheat bi-parental populations. Similarly, in a population of 235 soybean varieties, predictive ability did not change much, whatever the number of markers (ranging 200-5200) and the way they were selected, either at random or one per haplotype block [46]. In a study of wheat breeding lines [47] found a plateau for predictive ability of yield around 2000 random markers as in the present study. Avoiding selection of markers pairs which are in high LD (LD-pruning) further improves predictive ability compare to random sampling. This was reported in a soybean study, in which the authors found a 4% increase of prediction accuracy when selecting markers from haplotype blocks rather than random or equidistant.
Selecting markers that are significantly associated with QTLs can achieve higher predictive ability than randomly selected markers, and surprisingly even higher predictive ability than using all markers. However, it is clear from Fig 4 that selecting markers from their Pvalue in GWAS carried out on the whole population; i.e. including validation lines, does led to overfitting and this approach must be avoided and cannot be used in practice. This was reported many times. For example, in a wheat study [47] selected markers by GWAS on the training set only, as we did also in Fig 4. However, they observed a gain in predictive ability of up to 0.2, particularly for very small number of markers (<100), while we only had small improvement of about 0.01 with a maximum gain for 2000 markers. Nearly similar results were reported by [48].
In theory, the prediction accuracy is positively related to the training population size, as established by simulation studies [49–51]. This was confirmed in many empirical studies such as those already mentioned [42, 46, 47]. As expected from the theory, optimizing a subset of training lines gave higher PA than random selection. This was confirmed in empirical studies [23] then [52–53] in the case of population structure. Their optimization algorithm uses a simulated annealing approach. Other optimization methods have been proposed, such as a genetic algorithm [54] and also led to higher PA for a given training size.
Effect of prediction method
Prediction methods have been classified into parametric and non-parametric (or semi-parametic) methods, and parametric methods sometimes split into penalized approach and Bayesian approach (for reviews see [17] and [55–57]). In our study, most methods gave close values of predictive abilities, ranging 0.475 −0.543, with the noticeable exception of SVM, which worked with up to 5000 markers (data not shown), but failed with the 47 K markers. Some methods seem to give poorer results, such as LASSO and EN (elastic net) from the glmnet library, and also BRNN. In this latter case it may be due to insufficient computer resources, who led us to use restrictive parameters (e.g. number of iterations, number of neurone layers…), thereby limiting performances of this highly demanding method. In any case predictive ability is not related to computing time, with the less demanding GBLUP ranking the fifth best method.
Although there is no clear-cut separation between parametric vs non-parametric methods, nor between penalized vs Bayesian approach, it seems that methods that are supposed to better capture non additive and/or non-linear effects such as EGBLUP or RKHS gave higher PA, as already reported [33,42,48,56]. Low difference in predictive ability has already been reported by [39] who analysed three wheat populations for GEBV using 10 statistical models, with RKHS being the most accurate and Support Vector Machine the least accurate methods, as we also found in our study.
In a simulation study, [58] showed that with 121 markers with additive effects, RKHS and radial basis neural network (close to our BRNN option) clearly out passed the linear Bayesian LASSO, but it was no longer the case when adding the 7260 interactions between markers. The authors stated that “adding non-signal predictors can adversely affect the predictive accuracy of the non-linear regression models”. Other studies have shown the interest of machine learning methods for genomic prediction [59–60]. In our study, this was not the case for RKHS, but for SVM and BRNN. It may have been better to choose the set of LD-pruning selected markers to optimize predictive ability of these methods.
But perhaps more important than the predictive ability is to know whether the different methods give similar values or at least similar rankings of GEBV for candidates. Indeed correlations between predicted GEBV with the 15 methods range from 0.85 to nearly 1. The highest values being observed between linear parametric methods, whatever based on penalized regression or Bayesian approaches. Machine learning methods such as SVM and BRNN are least related to others, except with MKRKHS. Given the close values of both predictive abilities and GEBV estimates among methods, it seems reasonable to keep the historical GBLUP as a reference method, for its simplicity and fastness, at least for polygenic traits such as grain yield chosen to illustrate this study.
Conclusion
The R pipeline we have developed is based on publically available libraries and therefore offers a full freedom to operate. It is easy to handle and allows a wide range of options for missing data imputation, marker or training set selection and prediction methods. Its parameterization was fixed for medium sized datasets to make it easy to use for beginners or teaching. Applying this tool with defaults parameters to a set of elite breeding lines with historical data from yield trials allowed us to obtain similar results to those reported on other wheat populations. The options for subsampling markers and/or training set enabled us to illustrate theoretical expectations (e.g. [61]). The predictive abilities obtained on this population of limited size are encouraging for the success of genomic selection in applied wheat breeding. Of course, BWGS does not deal with all challenges. It is now admitted that most methods give reasonably high accuracy, although recent studies claim that prediction accuracy could be improved with new alliances to share data across breeding programmes [62]. Rather, the challenges for future wheat breeding are 1) efficient implementation in real breeding schemes and/or adapting selection schemes with step(s) of GS (e.g.[63–64]), 2) prediction of GxE (e.g. [65]for review) and 3) incorporating multitrait selection (e.g. [66]).
Future developments of BWGS are ongoing to address these challenges, particularly multitrait selection.
The source code of BWGS R functions as well as example files and notice are available on https://forgemia.inra.fr/umr-gdec/bwgs
Acknowledgements
This work was supported by the BreedWheat project thanks to funding from the French Government managed by the National Research Agency (ANR) in the framework of Investments for the Future (ANR-10-BTBR-03), France AgriMer and the French Fund to support Plant Breeding (FSOV).