Harnessing multivariate, penalized regression methods for genomic prediction and QTL detection to cope with climate change affecting grapevine

Viticulture has to cope with climate change and decrease pesticide inputs, while maintaining yield and wine quality. Breeding is a potential key to meet this challenge, and genomic prediction is a promising tool to accelerate breeding programs, multivariate methods being potentially more accurate than univariate ones. Moreover, some prediction methods also provide marker selection, thus allowing quantitative trait loci (QTLs) detection and allowing the identification of positional candidate genes. We applied several methods, interval mapping as well as univariate and multivariate penalized regression, in a bi-parental grapevine progeny, in order to compare their ability to predict genotypic values and detect QTLs. We used a new denser genetic map, simulated two traits under four QTL configurations, and re-analyzed 14 traits measured in semi-controlled conditions under different watering conditions. Using simulations, we recommend the penalized regression method Elastic Net (EN) as a default for genomic prediction, and controlling the marginal False Discovery Rate on EN selected markers to prioritize the QTLs. Indeed, penalized methods were more powerful than interval mapping for QTL detection across various genetic architectures. Multivariate prediction did not perform better than its univariate counterpart, despite strong genetic correlation between traits. Using experimental data, penalized regression methods proved as very efficient for intra-population prediction whatever the genetic architecture of the trait, with accuracies reaching 0.68. These methods applied on the denser map found new QTLs controlling traits linked to drought tolerance and provided relevant candidate genes. These methods can be applied to other traits and species.


81
Iwata (2013); Guo et al. (2014) compared univariate vs multi-82 variate models' performance. They found a slight advantage of 83 multivariate analysis when heritability was low and data were 84 missing. Predictive ability was particularly improved if the test 85 set had been phenotyped for one trait while prediction was ap-86 plied to another correlated trait (trait-assisted prediction) as in 87 Thompson and Meyer (1986); Jia and Jannink (2012) Liu et al. (2020). 89 However, this breaks independence between the training and 90 the test set, leading to an over-optimistic prediction accuracy 91 (Runcie and Cheng 2019). Multivariate methods have also been 92 proposed for QTL detection by Jiang and Zeng (1995); Korol et al. 93 (1995); Meuwissen and Goddard (2004), notably for distinguish-94 ing between linkage and pleiotropy when a QTL is common 95 to several traits. Some methods of multivariate penalized re-96 gression, such as in Chiquet et al. (2017), were designed to be 97 useful for both QTL detection and prediction of genotypic value. 98 Multivariate GP methods are expected to perform better if traits 99 are genetically correlated but this remains worth testing with 100 additional data. We also hypothesize that these methods will 101 have higher power for QTL detection, by making a better use of 102 information on the genetic architecture of several intertwined 103 traits.

104
Methods designed for QTL detection are rarely used for geno-105 typic value prediction. As they select only the largest QTLs, we 106 hypothesize that these methods will provide an accurate pre-107 diction as long as the genetic architecture is simple, but yield 108 poor prediction performance otherwise, as concluded in several 109 studies (Heffner et al. 2011;Wang et al. 2014;Arruda et al. 2016). 110 Conversely, some methods for GP like the LASSO and its exten-111 sions are also able to select markers with non-null effects, hence 112 to perform QTL detection. Their accuracy in detecting QTLs has where ρ B is the genetic correlation among traits and σ 2 B 1 and 199 σ 2 B 2 the genetic variances for both traits y 1 and y 2 . In the same 200 way, E ∼ MV(0, I, V E ), with the k × k error variance-covariance where ρ E is the residual 202 error correlation among traits, and σ 2 E the error variance. We set 203 ρ B to 0.8, σ 2 B 1 and σ 2 B 2 to 0.1, ρ E to 0 and narrow-sense heritability 204 to 0.1, 0.2, 0.4 or 0.8 and σ 2 E was deduced.

205
To explore different genetic architectures, we simulated s = 206 2 or s = 50 additive QTLs, located at s SNP markers, so that all 207 corresponding additive allelic effects had non-zero values in B.

208
Since all allelic effects were drawn from the same distribution, 209 all QTLs had "major" or "minor" effects for s = 2 and s = 50, 210 respectively. All dominant allelic effects were set to zero. Two 211 QTL distributions across traits were also simulated. For the first 212 one, called "same", all QTLs were at the same markers for both 213 traits. For the second one, called "diff", the two traits had no 214 QTL in common. Thus, there was genetic correlation among 215 traits only for the "same" QTL distribution.

216
For each configuration (2 or 50 QTLs combined with "same" 217 or "diff" distribution), the simulation procedure was replicated 218 t = 10 times, each with a different seed for the pseudo-random 219 number generator, resulting in different QTL positions and ef-220 fects.

221
In a first simulation set, narrow-sense heritability was as- test the simulation parameters from Jia and Jannink (2012)

308
Genomic prediction can be seen as a high-dimension regres-  For QTL detection on simulated data, the methods were com-

458
Genomic prediction and QTL detection 5 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020. ; https://doi.org/10.1101/2020.10.26.355420 doi: bioRxiv preprint For experimental data, the whole nested cross-validation process was repeated 10 times (r=10), whereas for simulated  (2003)). Any marker 497 selected at +/-2 cM of a simulated QTL was counted as a True 498 Positive.

499
For methods with two tuning parameters, one parameter was 500 kept constant (α at 0.7 for EN and EN.mFDR, and λ 2 at 10e-8 501 for SPRING). We tested several values of α for EN but it did not QTLs.

509
For MTV_LASSO and SPRING, we split traits into three 510 groups as described above, for computational reasons (for 511 SPRING) and to test whether such splitting gave more reliable 512 QTLs (for MTV_LASSO). The parameters of penalized methods 513 were tuned by cross-validation, with MSE as the cost function. 514 We compared predictor selection between methods in terms 515 of the number of common selected markers per trait, i.e. the 516 intersection between markers selected by penalized methods 517 (focusing on LASSO and EN) and markers inside confidence in-518 tervals found by interval mapping methods (focusing on MIM).

519
Then all markers in high LD with those selected were considered 520 as selected too. The threshold was defined as the 95% quantile 521 of LD value distribution, for all pairs of markers belonging to 522 the same chromosome ( Figure S5), which gave a LD threshold 523 of 0.84. 524 We deemed selected markers as highly reliable if they were 525 either i) selected by at least five methods, whatever the meth-526 ods, ii) or selected by both EN.mFDR and MIM (see Results).

527
Then, we defined a highly reliable QTL as the interval of +/-3 528 cM around each highly reliable marker (Price 2006;Viana et al. 529 2016b), as predicted by loess fitting of genetic positions to physi- Brault et al.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020.

579
Genetic mapping 580 We constructed a saturated consensus genetic map with 3,961  were not significant ( Figure S9). On the opposite, prediction pared to RR and spring.reg never performed better than RR.

635
Since these results were unexpected, we also compared pre-636 diction accuracy of the above methods with the simulated data 637 published by Jia and Jannink (2012). We obtained very similar 638 differences among methods as with our simulated data, even 639 though prediction accuracy was higher in all cases ( Figure S10).

641
We compared the main methods mentioned above (except . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020.     Genomic prediction and QTL detection 9 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020.

Figure 4
Mean genomic predictive ability (Pearson's correlation between genotypic BLUPs and their predicted values), obtained by cross-validation for 10 methods applied to 14 traits related to water deficit and GBS gene-dose data, within a grapevine bi-parental population. Broad-sense heritability values are reported for each trait (y-position of the number corresponds to heritability estimate). Traits are ordered by decreasing heritability. For each trait, predictive ability is averaged over 10 cross-validation replicates x 5 cross-validation folds).  Figure S18).

741
The number of markers selected by MIM, LASSO and EN was 742 905, 1009 and 1550, respectively (Table S19) Table S19. A 764 visualization of these results is given in Figure 6 for night-time 765 transpiration under water deficit ( TrS_night.WD) and in Figure   766 S20 for all traits.

767
Most of the time, more markers were selected for traits under 768 water deficit than for traits in well-watered conditions, and they 769 were most often selected by several methods. We showed that 770 penalized methods tend to select the same markers, not only 771 close ones; for example, for TrS_night.WD on chromosome 4, the 772 same marker (at physical position 21,079,664 bp) was selected 773 by seven methods (Figure 6). 774 We considered markers selected by both MIM and EN.mFDR 775 as highly reliable ones for three reasons: 1) markers selected by 776 both MIM and EN were considered as reliable ones (see above); 777 2) simulations showed that MIM and mFDR methods led to 778 a very low FPR; 3) these methods belong to different method 779 classes (interval mapping vs penalized regression). We also 780 considered as highly reliable the markers selected by at least five 781 methods. These criteria resulted in a set of 59 highly reliable 782 selected markers, which were converted to genetic intervals of ± 783 3 cM around each selected marker. Overlapping intervals per 784 trait were merged, resulting in 25 highly reliable QTLs.

785
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020.  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020. ;https://doi.org/10.1101/2020 doi: bioRxiv preprint   Table S25 for the details of this classification. When an integrated function at the organ or plant level was explicitly quoted in the gene annotation, genes were classified on this basis. When no integrated function was explicitly quoted, they were classified based on their cellular or molecular function. In both cases, functions were then classified as "Related" if related to the traits of interest in this QTL, or "Unrelated" if not.
data, and re-analyzed grapevine phenotypes obtained under 850 semi-controlled conditions. In particular, we showed that penal- and selection in most cases with highly correlated predictors. In 871 agreement with these results, we showed that EN performed 872 well for both prediction and selection on our simulated data, 873 and that multivariate EN performed the best for prediction on 874 the grapevine experimental data. 875 We also compared SPRING that can explicitly make use of a 876 genetic map. We observed that SPRING had a larger increase 877 in predictive ability from SSR to SNP design matrix than other 878 methods ( Figure S4). This was probably due to the fact that 879 SPRING uses LD pattern for prediction, this pattern being better 880 captured with a dense genetic map. However, SPRING showed 881 no systematic advantage over other penalized methods for pre-882 diction with the dense SNP map (Figures 2, 4).  (Figures 2 and S6).When different heritability 907 values were simulated for the two traits, we observed a slight 908 superiority of MTV_LASSO (resp. MTV_EN) over LASSO (resp. 909 EN) only in the "same" and "major" configuration (with both 910 traits sharing the same two QTLs) for the trait with small heri-911 tability ( Figure S9). and Jannink (2012) data allowed us to display a higher differ-928 ence between univariate and multivariate LASSO than with our 929 simulated data.

930
Unexpectedly, when reanalyzing the data simulated by Jia 931 and Jannink (2012), we obtained lower prediction accuracy with 932 our MTV_LASSO ( Figure S11) than they did with their multi-933 variate BayesA (their Figure 1A). A similar result in a univariate perimental data (Figure 4). However, we cannot exclude that 1104 this might be due to a poor optimization of Extreme Gradient to correctly fit the model.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 28, 2020. ; https://doi.org/10.1101/2020.10.26.355420 doi: bioRxiv preprint