Preselection of QTL markers enhances accuracy of genomic selection in Norway spruce

Genomic prediction (GP) or genomic selection is a method to predict the accumulative effect of all quantitative trait loci (QTLs) in a population by estimating the realized genomic relationships between the individuals and by capturing the linkage disequilibrium between markers and QTLs. Thus, marker preselection is considered a promising method to capture Mendelian segregation effects. Using QTLs detected in a genome-wide association study (GWAS) may improve GP. Here, we performed GWAS and GP in a population with 904 clones from 32 full-sib families using a newly developed 50 k SNP Norway spruce array. Through GWAS we identified 41 SNPs associated with budburst stage (BB) and the largest effect association explained 5.1% of the phenotypic variation (PVE). For the other five traits such as growth and wood quality traits, only 2 – 13 associations were observed and the PVE of the strongest effects ranged from 1.2% to 2.0%. GP using approximately 100 preselected SNPs, based on the smallest p-values from GWAS showed the greatest predictive ability (PA) for the trait BB. For the other traits, a preselection of 2000–4000 SNPs, was found to offer the best model fit according to the Akaike information criterion being minimized. But PA-magnitudes from GP using such selections were still similar to that of GP using all markers. Analyses on both real-life and simulated data also showed that the inclusion of a large QTL SNP in the model as a fixed effect could improve PA and accuracy of GP provided that the PVE of the QTL was ≥ 2.5%. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09250-3.


Introduction
Genomic prediction (GP) or genomic selection using genome-wide dense markers has been widely adopted in animal breeding and extensively studied in crops [1] and tree plant species [2] in the last decade. GP assumes that individual quantitative trait loci (QTLs) are linked with at least one DNA marker. Consequently, linkage disequilibrium (LD) between QTLs and markers plays an important role in increasing genomic prediction efficiency [3]. It has also been suggested that GP model training encompassing multiple generational populations (tracing LD) also will increase GP efficiency. Several previous GP results have shown that markers in stronger LD, or even coinciding with causative mutations, implies higher accuracy of GP. This is because there is no need to trace the causative mutations with LD markers when the causative mutations are among the genotypes [4]. Therefore, the inclusion of markers tightly associated with a few large-effect QTLs, detected by genome-wide association (GWAS) or validated by gene transformation, could be incorporated into the GP model development [5]. Including such large-effect SNPs as fixed effects in GP modelling is also considered as an ideal approach to increasing GP efficiency, which has been verified by several studies in crops, including simulations [6][7][8]. GWAS is considered a powerful approach to dissecting the genetic architecture of different traits in animals and plants [9,10]. Usually, a locus accounting for 1-15% of phenotypic variation is detected by GWAS if population sizes are large enough, from a few hundred to a few thousand in plants [11], and a few thousand to a few million in humans [12].
Several studies in trees have shown, that GP using a few thousand randomly selected SNPs may capture most of the variation and achieve a similar GP accuracy as using all available markers [13][14][15][16]. It has also been reported that marker preselection could slightly-to-moderately improve GP accuracy in tree species [17][18][19]. In theory it is desirable that GP captures all QTL effects by LD between markers and causative loci. But provided that one or two of several strongly linked markers for each QTL is selected, the model should capture the QTL effects almost in their entirety.Recently, following the development of high throughput genotyping techniques and tools, more than several hundred thousand markers have been commonly and easily produced by several genotyping platforms, such as SNP array, exome capture, and genotyping-by-sequencing (GBS). Thus, marker preselection could become a very useful and common pre-step for GP.
Norway spruce (Picea abies (L.) Karst.) is one of the most important economic species in Europe, especially in the Nordic countries and Northwestern Russia [20]. The current breeding program of Norway spruce, similar to other conifer species, mainly focuses on the use of additive genetic effects (i.e. breeding values). However, the non-additive genetic effects are considered important if a clonal deployment is considered as a deployment strategy in the future [21][22][23]. Recent several studies on the cost, benefit and genetic diversity with the deployment of clonal forestry for Norway spruce indicated that a considerable productivity increase of planted forests could be achieved with an acceptable genetic diversity [23][24][25][26][27][28][29][30].
In Norway spruce breeding, traditional breeding values, dominance, and epistatic genetic effects were predicted for seedling or clonally propagated progenies, based on the theoretical expectation using pedigree data [23]. When genome-wide dense markers are available, the estimates of genetic parameters for additive, dominance, and epistatic effects may be more accurate since the real genomic relationships were captured by alleles [31,32]. Several studies in tree species have demonstrated an increase in the accuracy of genetic parameters estimates using genomic models [33][34][35]. The accurate estimation of non-additive effects, as well as additive effects, should therefore be an important objective to increase genetic gains and improve the efficiency of the Norway spruce tree breeding program.
In this study, we explored the use of detected GWAS QTLs by including the most closely associated SNPs in GP for tree breeding value prediction through empirical experiments and simulations. The detailed aims of this study were to: 1) dissect the total genetic component into additive and non-additive variances using genomic-based relationship matrix models and compare with the corresponding dissection using traditional pedigree-based relationship models; 2) test the efficiency using different number of SNPs, different number of clones per family with informative marker preselection; 3) investigate the impact of training population size and relatedness between training and validation populations on GP efficiency; and 4) test whether GP could be improved by including the most significant GWAS-marker as a fixed effect in the GP model.

Genetic parameter comparisons between models
We compared the goodness of fit of four models which each could utilize traditional pedigree data (PBLUP) or genomic data (GBLUP). In brief, the models included one simple additive model (PBLUP-A, GBLUP-A), one additive model where a residual genotypic (non-additive clonal) effect was included (PBLUP-AR, GBLUP-AR), one model where a dominance term was included on top of the additive and residual clonal terms (PBLUP-ADR, GBLUP-ADR) and finally a model where one specific epistatic term also was added (PBLUP-ADR-xx, GBLUP-ADR-xx). Please see Material and Methods section Eqs. (1-4), Table 1 and Table S1 for more detailed information. For all traits, the GBLUP-AR model had the smallest Akaike information criterion (AIC) value, except for frost damage (FD) with a zero non-additive variance (i.e. GBLUP-A). This indicates that the fitting of GBLUP-AR was generally better than all other models, both pedigree-based and genomic-based, and implies that the genetic parameters of the GBPLUP-AR models should be better estimated than for the other models. Based on the AIC, we did not see that models with a dominance term (PBLUP-ADR and GBLUP-ADR) or first-order epistatic effect terms (PBLUP-ADR-xx and GBLUP-ADRxx) showed any better fit than the simpler GBLUP-AR or even GBLUP-A (for FD) models.

Estimates of variance components and heritability
We found that the additive genetic variance under PBLUP-AR and GBLUP-AR decreased compared to that under PBLUP-A and GBLUP-A for HT6, HT12, DBH, and BB, respectively (Table 1). Given that the PBLUP-AR and GBLUP-AR exhibited the best fit for all these traits, the previously observed patterns indicate that the additive genetic variance in the PBLUP-A and GBLUP-A may be inflated due to the inadvertent inclusion of non-additive effects within σ 2 a . For the growth traits (HT6, HT12 and DBH), the estimates of the non-specific non-additive genetic variance ( σ 2 r ) were substantial in size relative to the additive genetic variance estimates ( σ 2 r > 0.2σ 2 a ) under the model with lowest AIC (GBLUP-AR). For BB and PILO the non-additive variance estimates were relatively small in comparison to the additive variance ( σ 2 r < 0.2σ 2 a ). Finally, FD did not exhibit any non-additive effects regardless of whether the pedigree-or genome-based matrix models were used. Taken together, we conclude that non-additive effects are important for all growth traits, whereas nonadditive effects had a limited impact on non-growth traits BB and PILO and were absent in FD. Estimates of dominance variance were generally small and non-significant and most of the first-order epistatic interactions under PBLUP-ADR-xx and GBLUP-ADR-xx were in boundary close to zero based on Equation [4] (Table S1). Thus, we do not discuss the result of the PBLUP-ADR, GBLUP-ADR, PBLUP-ADR-xx and GBLUP-ADR-xx models any further.

Summary of Norway spruce 50 K SNP array, LD decay, and association mapping
For the Piab50K SNP array, 41,236 of the total 47,445 SNPs were mapped and evenly distributed across the 12  (Table S2). The physical extent of LD (r 2 > 0.2) within each chromosome varied from 33.7 kb in chromosome 2 to 54.6 kb in chromosome 9, with an average of 42.9 kb for the whole genome based on the SNP array in the studied full-sib family population (Fig. S1). The family clusters were clearly separated by the first two principal components of a marker-based principal components analysis (PCA) (Fig. S2). In total, GWAS identified 41, 11, 2, 4, 4, 11, and 13 SNPs as having a significant effect on BB, DBH, FD, HT12, HT6, and PILO, respectively, under a false discovery rate (FDR) of 0.05 (Fig. 1b). The largest effect sizes for trait-associated SNPs explained 5.1, 1.6, 1.7, 2.0, 1.2, and 1.4% of the phenotypic variation of clonal means in BB, DBH, FD, HT12, HT6, and PILO, respectively (Table  S3). There were six highly significant SNP-trait associations with considerable PVE (> 2.5%) for BB whereas such observations were not made for the other traits.

Predictive ability comparisons between PBLUP and GBLUP
We initially estimated predictive abilities (PAs) by a ten times tenfold cross-validations where PBLUP and GBLUP models were trained by the randomly selected portions of the population where phenotypic clonal means were made available whereas the validation was made on the other portion of the population where phenotypic data was kept hidden. The PA-estimates were then compared to the square root of the clone mean narrow-sense heritability (h c , Table 2 and Table S4) because the latter can be interpreted as the theoretical PA and accuracy for mass selection entirely based on phenotypic clonal means. In agreement with previous comparisons between PBLUP and GBLUP in terms of AIC, we observed that PA-estimates of GBLUP (0.23 -0.67) were consistently higher than corresponding PA-estimates for PBLUP-C (0.21 -0.61). The increases in PA under GBLUP were the greatest for HT6 and BB. For example, PA of GBLUP for HT6 increased by 15.8% when compared to PBLUP. However, the PA-estimates for GBLUP in the tenfold cross-validation scheme were in turn consistently and considerably lower in comparison to the corresponding theoretical PAs (i.e. square root of heritability, h c in the range 0.37 -0.89).

The impact of the number of SNPs included in the G-matrix
For all traits, the overall PA of the GBLUP model increased from 25 randomly selected SNPs to ca. 4000 SNPs (Fig. 2a) and plateaued when even more SNPs were selected. Similarly, the standardized AIC values of the GP model decreased from the model with 25 SNPs to ca. 4000 SNPs and then the AIC stabilized when even more SNPs were selected (Fig. 2b). For example, for BB, PA under the GBLUP-model increased from 0.36 using 25 SNPs to 0.63 using 4000 SNPs in the estimation of a genome-based relationship matrix. When marker preselection was based on GWAS-analyses, the resulting prediction models showed two types of trends for the traits. First, for BB, the genomic prediction model (GBLUP-S) with 100-200 preselected SNPs with the smallest p-values obtained a higher PA than that  Benjamini and Hochberg (1995). SNPs not mapped into the Norway spruce genome v2 (In preparation) were grouped into a region assumed as chromosome 13 in this study. BB, bud burst stage; DBH, diameter at breast height; FD, frost damage; HT12, tree height at field tree age 12; HT6, tree height at field age six; PILO, Pilodyn penetration using all other numbers of SNPs (Fig. 2a). To check if GP with a marker preselection by GWAS captured more Mendelian segregation effects, we also calculated the correlation between estimated breeding value deviations from family mean of EBVs and within-family phenotypic variation (clone mean-family mean). We found that GBLUP-S using 100-200 informative markers captured more of the phenotypic variation of BB and exhibited the lowest AIC value (Fig. 2c). This suggests that using all available SNPs for a trait that was strongly associated by GWAS to a considerable but finite number of markers, may introduce misleading noise in the prediction models.
Second, for the other five traits (DBH, FD, HT12, HT6, and PILO), we found that the PA of GP using a preselected set of SNPs with the smallest p-values did not outperform random marker selection unless the numbers of SNPs selected were less than a few hundred. Furthermore, for these traits preselecting more markers beyond a few hundred still increased the PA with the highest value being reached when all markers were used (Fig. 2a). For traits DBH, FD, HT12, HT6, and PILO, within-family PA for marker selection based on low p-values was higher than for random selection when smaller numbers of markers were selected all ranging from 25 to 2000 SNPs. But random selections of markers showed equal or higher within-family PA than GWAS-based preselection for these traits when more than 10,000 SNPs were selected. The trends of PA when selecting more and more markers for DBH, FD, HT12, HT6, and PILO were not obvious where a flat within-family PA optimum for GWAS-based preselection was observed for HT12 and DBH (at 1000 and 200 markers respectively) whereas for the other traits more markers resulted in higher withinfamily PA. AIC values for all traits except BB showed a similar trend (Fig. 2c) where the lowest AIC value occurred at 2000-4000 preselected markers.

Predictive ability comparisons between advanced GBLUP models
As indicated in Fig. 2, only BB showed appreciable increases in overall PA from 0.61 for GBLUP to 0.72 for GBLUP-S when only 100 high-significance SNPs in GWAS were included in the G a -matrix (Table 2). For all other traits the GBLUP-S model including 2000 highsignificance SNPs showed similarly or lower PA than the conventional GBLUP including all SNPs. In addition, we attempted the fitting of a single SNP showing the smallest p-value (the highest significance) in GWAS as an additional fixed regression effect in the model (GBLUP-F). For HT12, BB, and FD, the GBLUP-F model produced a slightly higher PA than GBLUP (by 0.02 -0.03) but for DBH, HT6, and PILO, GBLUP-F and GBLUP showed similar values. Furthermore, we calculated an approximation of cross-validation accuracy in the absence of phenotypic data as the PA of PBLUP and GBLUP divided by h c and we found that the overall accuracy of GP for all traits ranged from 0.58 for HT6 to 0.81 for BB.

Simulations of including a large effect SNP as fixed in the GBLUP
To verify whether including a major gene locus as a fixed effect in the GBLUP model (GBLUP-F) would increase PA of GP, we also performed 10 repeats of tenfold cross-validations on simulated data where single SNPs exerting a range of PVE were included (Table 3). Simulation data is easier to interpret than real-life data since it offers the possibility to robustly estimate GP accuracies merely by calculating the correlations between predicted and true breeding values. Results showed that the overall accuracies of the GBLUP-F model were higher than that of GBLUP when provided a major-effect SNP with PVE > = 2.5% (0.66-0.71 and 0.60-0.63 for GBLUP-F and GBLUP Table 2 Overall predictive ability (PA) for different types of pedigree-and genome-based prediction models following a tenfold crossvalidation procedure The prediction models showing the highest PA for a trait have their PA-estimated highlighted in italic bold. The h c is the square root of the clonal mean narrow sense heritability ( h 2 e ) based on PBLUP-AR model. PBLUP-C is the traditional pedigree-based best linear unbiased prediction (BLUP) including marker-based pedigree correction; GBLUP is the genomic-based BLUP; GBLUP-S is genomic-based BLUP and with marker-preselection based on the smallest p-values from genome-wide association analyses (GWAS) for each training population. In this table, the 100 smallest p-value SNPs were preselected for the budburst stage (BB) and 2000 SNPs were preselected for the other traits. GBLUP-F is a genomic-based BLUP model plus the SNP with the smallest p-value fitted as an additional fixed regression effect. This SNP was also selected based on GWAS for each training population. respectively). Following a similar trend, the overall PA of the GBLUP-F model increased from 0.31 to 0.35 for a fixed SNP explaining 0% to 5% of the phenotypic variation whereas no increasing trend was observed for the model without the major QTN included.
The within-family accuracy of the GBLUP-F model increased from a very low level (0.10) when the fitted fixed-effect SNP explained 0% of the variation (i.e. fitting a false-positive QTL) to 0.42 for a SNP with a PVE at 5% (true major gene, Table 3). In contrast, for the model where the assumed major SNP was not fitted as a fixed effect, within-family accuracies remained at very low levels (0.09-0.11) regardless of the PVE of the major SNP. The trends of within-family PA were similar as those for accuracy but were all lower due to the environmental noise that always influences PA-estimates. The above results indicate that including a large-effect SNP (PVE > = 2.5%) as a fixed effect in the GBLUP model may improve the accuracy and PA, especially with respect to predictions within family.

Number of clones per family
To test the effect of the number of clones per family available to the training dataset on the PA using PBLUP and GBLUP (Fig. 3), we sampled 5, 10, 15, 20, 25, or 30 clones from each of the ten largest families as a training data set and the rest of clones in those families as a validation set. We found that the PA of both PBLUP and GBLUP consistently increased from 5 clones per family to 30 clones per family for all traits except for PBLUP for BB where an optimum was reached at 20 offspring clones per family. Based on the trends and with GP in mind (GBLUP), Table 3 Predictive abilities and accuracy estimates with genomic prediction model with and without adding an SNP with the largest effect size as a fixed effect (GBLUP and GBLUP-F) based on simulated data using a trait with a heritability of 0.25 and evaluated using tenfold cross-validations PVE is the percentage of phenotypic variance explained by a large effect size SNP. QTN represents the quantitative trait nucleotide with a large effect size whereand + signifies the absence (GBLUP) and presence (GBLUP-F) respectively, of this QTN in the genomic prediction model as a fixed effect. The presumed major-effect locus/QTN is fixed for each training population. The value in parenthesis indicates the standard deviation across replicate simulations

Number of clones per family and marker preselection by GWAS also affect the within-family variation
To further test whether combining GWAS p-value based marker preselection and the number of clones per family would affect the predictive ability (PA) within families, we performed tenfold cross-validations using GBLUP within the largest ten families only (Fig. 4). We found that the PA within the ten largest families using GBLUP with GWAS-based preselection of a number of markers produced similar trends (Fig. 4a) as the corresponding PA values estimated for the population as a whole (Fig. 2a). In similarity to the results of the whole population, GBLUP of BB with 100-200 preselected SNPs produced higher predictive ability and lower AIC than using a lesser or greater number of SNPs for G a -matrix calculation ( Fig. 4a and b). For the other traits, 2000-4000 preselected SNPs produced the lowest AIC in agreement with the whole-population analysis. The trends for within-family PA, although less obvious, indicated that some sort of SNP preselection resulted in higher PA than uncritically using all SNPs in the model. These results could imply that more within-family variation linked to Mendelian segregation could be captured by preselecting influential markers for GP and that the capture of such segregation variation is likely easier in a population where families are fewer and larger thus offering a higher average relationship. Based on the marker preselection results for within-family prediction (Fig. 4c), we also found that PA increased quickly from 5 clones per family to 30 clones per family for all traits, except for FD, which showed a slight decrease for within family prediction after 25 clones per family. For example, PA within-family variation for BB increased from 0.07 at In practical breeding, GP may be performed for a new full-sib family which does not offer close relationships with existing families in a training dataset. Thus, we also evaluated one special type of cross-validation in which the clonal mean data for each of 32 families were in turn held out from the GBLUP training dataset, while the family for which clonal mean data was missing was used for the prediction validation (across-family validation). As previously, we performed the GBLUP with 1) all markers for G a -matrix calculation (GBLUP), 2) 100 preselected SNPs for BB and 2000 preselected SNPs for the rest of the six traits based on significant SNP-trait associations (GBLUP-S), and 3) all markers for the G a -matrix plus the marker with the highest GWAS significance fitted as a fixed regression effect in the model (GBLUP-F). We found that marker preselection increased PAs for some traits (HT12, PILO, BB, and FD), but not for others (HT6 and DBH). The PA-estimates across families were very low for most traits (0.03 to 0.11) but for BB the GBLUP PA was 0.19, increased to 0.24 when fitting the most powerful SNP as a fixed effect (GBLUP-F), and increased further to 0.25 when using GBLUP-S (Table 4).

Genomic-based BLUP model could enhance the accuracy of the estimated additive and non-additive genetic variances
In tree or conifer breeding programs, a control-pollinated clonal test provides an opportunity to simultaneously estimate the additive, non-additive, dominance, and epistatic variances using pedigree. However, most heritability estimates in many traditional tree breeding programs were based on open-pollinated or control-pollinated progeny trials without vegetative propagation. In this context, such estimates of additive genetic variation may be biased due to the difficulty of separating the additive genetic variance from parts of the dominance and epistatic variances [36]. Using vegetatively propagated material in combination with a genome-based model, the additive and dominance variances may still show bias if the model does not include a residual genetic effect [32,35]. Potentially, genome-based models could capture and discriminate between the different sources of the nonadditive genetic variance such as dominant effects within a locus and epistatic interaction effects among loci. Among different GP models, GBLUP-AR showed the smallest AIC values for most of the traits, indicating that non-additive genetic effects are significant for all traits, with frost damage being the only exception. Also, the fact that GBLUP-AR model AIC values for these traits were systematically lower than the corresponding PBLUP-AR AIC, implies that GP models perform better with respect to separating additive and non-additive variances from each other than pedigree-based models. For example, PILO showed non-significant and negligible non-additive effects under PBLUP-AR but showed nonetheless significant non-additive effects under the GBLUP-AR model, indicating that the GBLUP-AR model with a realized relationship matrix could better capture and separate the additive genetic variation thus improving genetic parameter estimates.

Significant marker-trait associations and genomic prediction
Recently, several studies on tree species reported that selecting markers with particular influence over a trait could improve PA [16]. Tan and Ingvarsson (2022) [19] reported that a careful 1% preselection of markers could improve the estimate of heritability and GP in a Eucalyptus population. For Pinus contorta Douglas ex Loudon var. latifolia, Cappa et al. (2022) [17] reported that selecting informative markers, in particular markers capturing ancestry/population structure, can improve PA.
The recently developed 50 k SNP array for Norway spruce included several QTLs per trait detected in our previous GWAS [37][38][39][40]. In this study, GWAS identified 44 associated SNPs for the budburst stage and one Table 4 Predictive ability for clone mean phenotypic variation and their standard errors in parenthesis for different models where the validation procedure was performed across different families In this scenario (GBLUP), each of the 32 full-sib families were separately cross-validated using the phenotypic data of the remaining 31 families as training sets. In total, 32 cross-validations were repeated to calculate the predictive ability of phenotypic clonal means. GBLUP-All is a model with all SNPs; GBLUP-S is a model with  Table S3), having the second highest PVE (> 4%) of all SNP-BB associations, was located within 400 base pairs from a QTL for BB previously detected by GWAS in a different population of ca. 4000 individuals [40]. We also observed two to 13 SNPs significantly associated to the other five traits, offering the opportunity to test if including the most significant SNP in GP model improves PA.

Marker preselection and inclusion of a large effect QTL as a fixed effect could enhance GP predictive ability
For BB, we found that a G a -matrix built from 100 preselected markers resulted in GBLUP-models having lower AIC values and being better at predicting the genetic value in the absence of phenotypic data than did a G a -matrix model using all available markers. Such preselected SNPs were also observed to capture a considerable amount of within-family variation/Mendelian segregation effects. However, for other more polygenic traits, genomic models using ca. 4000 preselected markers showed a comparable PA compared to the model using all markers, even though such models exhibited lower AIC values. This indicates that the preselection of influential markers is more likely to be successful when applied to a limited set of markers showing highly significant associations to a trait where a relatively limited number of trait-associated SNPs are indicated to exhibit considerable individual effects (PVE > 1%).
In this study, we also investigated a GP model for BB where a large-size QTL with a PVE of ca. 5% was explicitly included as a fixed regression effect and we observed a 4.4% improvement in overall PA (Table 2) in comparison to a GP model without such a modification. Such enhancement was also observed in several empirical crop studies [41,42] and a simulation study conducted by Bernardo (2014) [6]. Our finite-locus simulations also showed a similar result with 13% and 18% of improvement, in terms of PA and accuracy respectively, for the model including a locus of large effect size (PVE at 5%) as a fixed effect compared with the model without this model term. The model improvement was particularly notable when the objective was the capture of within-family Mendelian segregation variation (Table 3). However, if the QTL PVE was less than ca. 1.25%, the simulated data analyses did not indicate any such advantage for the model including the locus as a fixed effect. Based on more than dozens of GWAS results in tree species [11], SNPs detected with a PVE > 1.25% are not uncommon, indicating that GP appropriately utilizing a large-size QTL should be very useful to improve PA and accuracy.

Family size matters for the efficiency of genomic prediction
Model training using higher numbers of clones per family, is usually expected to capture more Mendelian segregation effects [43] and improves the PA and accuracy of genetic parameter estimates [44]. In this study, we observed that increasing the family size was important to improve the PA, both with respect to overall phenotypic prediction (Fig. 2) but also to within-family prediction (Fig. 4). This was especially obvious when the family size was small (less than 15 clones per family).

Relationship between training and validation datasets highly important for genomic prediction
Forward-selection tree breeding usually entails the selection of a few elite candidates within a progeny test population (F1) and the candidates are in turn crossed to produce a new batch of progenies (F2). Thus, when using existing F1 as a training dataset to predict F2, the average relationship between F1 and F2 will be lower compared with relationships within the same F1 generation as was used for tenfold random cross-validations in this study. We therefore also investigated the situation where training and validation datasets contained individuals from separate families, and we produced a 32-fold crossvalidation scheme by removing phenotypic data for one family at a time. The relationships between the validation family and the training dataset were thus restricted to the level of half-sibs or weaker. Thus, the PA across families was considerably lower than for conventional random cross-validation (Table 4), which was a result similar to that in Pinus taeda L. [35]. However, it is notable that BB also in this case offered an appreciable PA estimate (0.19) and this estimate was further improved if a marker preselection was performed for calculating the G a -matrix of the model (GBLUP-S, 0.25) or by fitting the most significant marker as a fixed effect in the model (GBLUP-F, 0.24).

Marker density and LD affects predictive ability of phenotypic variation and within-family variation
Marker density (i.e. the number of markers) is usually considered as one of the most important factors affecting GP performance [45]. However, in forest tree breeding, the capture of the expected pedigree relationships (e.g. 0.25 among half-sib siblings), only requires a few thousand SNPs. Literature indicates that such a number of markers would also be enough to achieve an overall predictive ability similar to models utilizing all markers in the same batch of markers [13,17,46]. In such situations, a few thousand SNPs preselected based on GWAS or other prior information could instead improve the PA, especially for traits where several large-effect QTNs have been found [17,19,40]. In theory, coefficients of the pedigree relationship matrix describe additive genetic relationships between individuals at quantitative traits loci [47], but in reality, it is not obvious to what extent the genomic relationship matrix explains a genetic covariance matrix between individuals for QTLs, especially for a targeted trait.
In animal breeding, for example in cattle breeding, using imputed whole-genome markers appeared to capture similar phenotypic variation as using a ~ 60 k SNP array due to strong LDs between markers and QTL. But such observations were made only within a few cattle breeds for a species with a moderate genome size, ca. 3.1G [5]. Conifers usually have a much larger genome size, such as Norway spruce with 20G of the total genomic content [48]. And also, LD decay in tree-breeding populations is much faster than in cattle-breeding populations. In this study, the extent of LD (r 2 ≥ 0.2) was observed to be 42.9 kb when based on the 50 k SNP array used in this population (Fig. S1). Based on such an extent of LD between markers and QTLs, the 50 k SNP array only covered ca. 25% of genome size. This could be one of the reasons why the PA of GP did not reach the standard value of the square root of additive clone mean heritability (Table 2) and why within-family and across-family PA was low for most of the studied traits (Table 4 and Fig. 4). In order to capture more Mendelian segregations between QTLs and makers, we would then need more markers for a successful marker preselection. For BB, as an example, the model with 100 pre-selected markers based on GWAS may have captured LD between markers and QTLs, and indeed captured a considerable amount of within-family variation in agreement with a few studies on other tree species [17,19,49].

Conclusion
Genomic selection with a marker preselection is considered an efficient approach in animal and tree breeding. For the budburst stage (BB), whose character within this study appeared to be oligogenic, a preselection of approximately 100 SNPs based on the smallest p-values from GWAS, showed the highest predictive ability (PA). But for the other traits, approximately 2000-4000 preselected SNPs, indicated by the smallest Akaike information criterion to offer the best model fit, still resulted in PA being similar to that of GP models using all markers. Analyses on both real-life and simulated data also showed that the inclusion of a large QTL SNP in the model as a fixed effect could improve PA and the accuracy of GP provided that the PVE of the QTL was ≥ 2.5%. Currently, most of the published marker resources designed for genomic selection in tree species, such as exome capture, genotyping-by-sequencing (GBS), and commonly designed SNP arrays did not consider the potential redundancy of some of the markers. Therefore, within such marker panels, many markers could be in strong LD with each other, but still not being closely associated to QTLs. Therefore, using all available markers in a genomic model will not necessarily improve genomic prediction, and may even decrease the prediction power, such as was observed for the studied trait BB. Thus, we encourage performing a marker preselection step for genomic prediction, especially when whole genome sequencing data or whole genome-imputed markers data becomes available. Meanwhile, the inclusion of a large QTL SNP in the model as a fixed effect is also recommended in genomic selection.

Plant materials
A Norway spruce breeding population using 32 controlpollinated families from 49 parents was established in 2007 at four different field sites. A total of 1430 unique clones were derived from the 32 families with an average of about 45 clones per family and about three ramets per genotype were originally planted in each site. The detailed descriptions of the four sites in this breeding population were presented in the published paper [21]. Generally, the field site series were established using a randomized incomplete block design with single-tree plots. Meanwhile, 98% of clones were replicated among the four sites. In this study, we selected all available clones from ten families (ca. 50 clones per family) and 20 clones from each of the remaining 22 families (a total of 904 clones) from a single site S1389 (Rössjöholm) for genotyping (Table S5).

Phenotyping
Tree height was measured at field age six (HT6) at all four sites and at twelve years (HT12) at one of the sites. Diameter at breast height (DBH) was measured at field age twelve at all four sites. Budburst stage (BB) was scored at field age six at three sites based on eight categories [50]. Pilodyn penetration (PILO) as a proxy of wood density was measured at a single site by Pilodyn 6 J Forest (PRO-CEQ, Zurich, Switzerland) at field age twelve. Frost damage (FD) was quantified at a specific single site because that site was exposed to a severe frost event at field age six. FD was scored as a categorical variable from zero (without frost damage) to three (the most severe damage). The detailed descriptions of traits measured in each of the four sites are shown in Table S6.

Genotyping
Newly fresh needles were sampled from 904 clones in the spring of 2018. Total genomic DNA was extracted using the Qiagen plant DNA extraction protocol with DNA quantification performed using the Qubit ® ds DNA Broad Range Assay Kit (Qiagen, Oregon, USA). Genotypic data were generated using the Norway spruce Piab50K SNP array chip [51]. Genotype calling of the 50 k Axiom array was performed as the description in the paper [51]. Here, missing SNPs were imputed by Beagle v4.0 [52].

Pedigree correction
Since a couple of clear discrepancies between the additive relationship and the genomic relationship matrices (A and G a , respectively) were detected for some individuals, we performed a pedigree correction for this population based on A and a heatmap of G a . The number of parents increased from 49 based on the documented pedigree to 55 based on G a and the number of families also increased from the original 32 to 56 (Table S5). Finally, the number of clones per family after correction varied from 1 to 56.

Spatial analysis
Spatial analysis based on a two-dimensional separable autoregressive (AR1) model was used to fit the row and column directions for phenotypic data from each site using ASReml v4.1 (Gilmour et al., 2015). Block effects were estimated simultaneously. Adjusted data, where all significant block and spatial effects were removed, were used for downstream analyses.

Variance component and heritability estimates
Four univariate models were used to estimate variance components for each trait based on pedigree-based best linear unbiased prediction (PBLUP) and genomic-based best linear unbiased prediction (GBLUP) as following: where y is the vector of adjusted phenotypic observations of a single trait; β is the vector of fixed effects, including a grand mean and site effects; a and d are the vectors of random additive and dominance effects, respectively;as and ds are the vectors of random additiveby-site and dominance-by-site effects, respectively; e xx is one of e aa , e ad , ande dd , which are the vectors of random additive-by-additive, additive-by-dominance, and dominance-by-dominance epistatic effects, respectively; r is the vector of residual genotypic effects, referring to an (1) y = Xβ + Z 1 a + Z 2 as + ε (2) y = Xβ + Z 1 a + Z 2 as + Z 6 r + Z 7 rs + ε (3) y = X + Z 1 a + Z 2 as + Z 3 d + Z 4 ds + Z 6 r + Z 7 rs + (4) y = X + Z 1 a + Z 2 as + Z 3 d + Z 4 ds + Z 5 e xx +Z 6 r + Z 7 rs + un-dissectable combination of dominance and epistatic effects in Eq. (2), epistatic effects in Eq. (3), epistatic effects excluding e xx effects in Eq. (4); rs is the vector of residual genotypic-by-site effects; ε is the vector of random residual effects. X, Z 1 , Z 2 , Z 3 , Z 4 , Z 5 , Z 6 , and Z 7 are the incidence matrices for β, a, as , d, ds, e xx , r, and rs, respectively (detailed descriptions in Supplementary methods S1).
The pedigree-based additive (A) and dominance (D) relationship matrices were produced using the AGHmatix package [53]. The genomic-based additive ( G a ) and dominance ( G d ) relationship matrices were constructed based on imputed SNP data as described by [54] for G a and by [55] for G d using AGHmatrix package in R [53]. The relationship matrices due to the first-order epistatic interactions were computed using the Hadamard product (cell by cell multiplication, denoted #) and trace (tr) [56]. Due to the fact that several parents were represented only in a single controlled cross, a hypothetical unique rare allele present in such a parent (heterozygote) would only be passed to about half of its progeny. Thus, SNPs with a minor allele frequency (MAF) less than M/(2n) were filtered when calculating all the genomic matrices for GBLUP-models, where M is the harmonic number of clones per family and n is the total population size in each GP model. When applied to the whole population as in this case (n = 904), we filtered away SNPs with a MAF < 0.006. More detailed descriptions of the matrices are shown in Supplementary methods S2 whereas the estimates of genetic variance parameters are shown in Supplementary methods S3.

Association mapping
To check the additive genetic architecture of the six traits in the breeding population, we also performed GWAS based on clone mean values across-site for all traits using the multi-locus BLINK model [57] conducted in GAPIT V3.0 R Software package [58]. Huang et al. [57] considered that BLINK could simultaneously control the rates of false positives and false negatives even without adding principal components or a genomic relationship in the model. Thus, we used a genomic inflation factor (i.e. λ, which is defined as the median of the resulting χ 2 test statistics based on p-values from GWAS divided by the expected median of the χ 2 distribution [59]) to check if BLINK has controlled both errors without adding both principal components and relationship matrix in the model. Because the observed genomic inflation factors (IF) all fell within the range 1 ± 0.05 (see QQ-plots in Fig.  S3), we did not add any extra model terms to control a population structure and cryptic relationships. Also here, we filtered SNPs with MAF < 0.006 for the purpose of GWAS and thus, 46,482 SNPs were kept. The genomewide significance of associations was determined at an experiment-wise false discovery rate 0.05 according to Benjamini & Hochberg (1995) [60]. The percentage of phenotypic variance explained (PVE) for each significant association was obtained from results using a mixed linear model (MLM) conducted in GAPIT V3.0 R software package [58].

Linkage disequilibrium
Genome-wide analysis of linkage disequilibrium (LD) was conducted in the F1 full-sib progeny population. All SNPs were mapped onto the Norway spruce genome v2.0 (In preparation). Out of the 47,445 SNPs available from the Piab50K SNP array, 43,267 SNPs were successfully mapped and were evenly distributed across the 12 chromosomes (Fig. 1a). The remainder 4,178 SNPs could not be correctly mapped to any chromosome. Therefore they were instead grouped together in a created dummy "chromosome 13" (Table S2). LD values for pair-wise SNPs within each chromosome were calculated using VCFtools [61]. A non-linear model was used to fit an LD decay trend based on the Hill and Weir formula [62].

Cross-validation test
Due to the negligible effects of dominance, dominanceby-site, and also first-order epistatic effects for all traits (Table S1), clone means calculated across-site were used as phenotype values to perform cross-validations. Ten sets of tenfold cross-validations were performed. In summary, we employed a model as below: where y' is the vector (904, 1) of the clonal means acrosssite, β is the vector of fixed effects (grand mean and an eventual single-locus SNP effect), a is the vector of additive effects and ε is the vector of random residual effects. X and Z 1 are the matrices related to the β and a. The random additive effects ( a ) were assumed to follow a ∼ N 0, Aσ 2 a with σ 2 a being the additive variance in PBLUP-A whereas the matrix A will be replaced by G a in the GBLUP-A model.
The prediction ability (PA) was defined as the Pearson correlation between predicted breeding values (EBVs) and clonal means. Furthermore, we calculated the PA (5) y ′ = Xβ + Z 1 a + ε for within-family predictions as Pearson correlations between the EBVs deviation from EBV family mean and the clonal mean deviation from the family means used as benchmark validation values. An estimate of the selection accuracy was calculated by dividing PA with the square root of the pedigree-based narrow-sense clonal mean heritability (PA/h c ).

Testing the efficiency of genomic prediction Marker density and maker preselection
To test the impact of the number of SNPs on the overall and within-family PA of GBLUP, we performed GPs using 14 subsets of SNPs (25, 50, 100, 200, 500, 1 K, 2 K, 4 K, 8 K, 10 K, 20 K, 30 K, 40 K, and all SNPs) and using two different types of sampling strategies: 1) randomly selected SNP subsets and 2) SNP subsets selected based on the smallest p-values shown in the GWAS for additive effects using BLINK in the training population. We performed these steps in both the whole population with tenfold cross-validation replicated 10 times (n = 100) and in a subset population with 482 clones from the 10 largest families. The general model fit was evaluated by inspecting their Aikake Information Criterion values (AIC) and the subset sample of markers showing the smallest AIC-values were selected for further investigations (henceforth called GBLUP-S). In the subset population (n = 482), we filtered SNPs according to the previously mentioned principles thus filtering out all SNPs with MAF < 0.025. Consequently, 42,665 SNPs were kept for the GWAS used for marker preselection.

Different statistical models
For in depth investigations of the impact of marker preselection, we tested and compared the efficacy for four different modelling approaches: 1) PBLUP-C: the traditional pedigree-based BLUP including pedigree correction. 2) GBLUP: a genome-based BLUP with a relationship matrix (G a ) estimated from all markers. 3) GBLUP-S: a genome-based BLUP with a G a matrix estimated from a subset of preselected markers selected based on GWAS results generated for each training population separately. The number of preselected markers depended on the perceived genetic architecture of each trait. 4) GBLUP-F: a genome-based BLUP with a G a matrix estimated from all markers, except for the marker with the greatest significance (smallest p-value) which instead was included as a fixed regression effect. This single marker was selected from GWAS results based on each training population.
ASReml-R v4.0.5 was used to perform cross-validations for the above four models. The Wald-F statistics was used to test the significance of the fixed regression/covariate effect. In addition to the four approaches mentioned above, we also performed several extra strategies.

Relationship between training and validation sets
Compared with the previously random cross-validation strategy, another validation method entailed the prediction of genomic breeding values of all progenies within a specific full-sib family based on model training using all the other families using GBLUP. In all, there were 32 such across-family cross-validation repeats and the relationships between training and validation sets were thus consistently weak (unrelated or half-sib). The PA within the validated family therefore largely depends on whether the model could capture the Mendelian segregation effects.

Family size (i.e. number of clones per family)
To test if an increase in family size could improve the PA of PBLUP-C and GBLUP, we randomly selected five to 30 clones per family as a training set for the largest ten families with 48-56 clones per family (Fig. S4), using the remaining clones in the ten families as a validation set. This evaluation was performed both for regular PA and within-family PA and cross-validation was performed based on the corrected pedigree where the number of clones per family varied from 32 to 56 (Table S5).

Simulations of large-effect SNPs and their inclusion in the genomic prediction model
To verify whether the inclusion of a major-effect locus as a fixed effect in the model would improve PA, we conducted finite-locus model simulations of a simplified breeding population undergoing one generation of directional selection. We used the software Metagene [63] to simulate a genomic architecture of 15,000 biallelic loci distributed along 12 chromosomes roughly like that of Norway spruce (see Supplementary methods S4 for details). The simulated breeding population comprised the crossing of 100 founders according to a single-pair-mating design producing 50 families and 2000 offspring individuals in the F 1 -testing population. The heritability of the studied virtual trait was kept at 0.25 and an additional "majorgene locus" was introduced into simulations where the PVE of this locus was set throughout the range 0% (false-positive QTL), 0.25% (minor effect), 1.25%, 2.5% and 5% (major true QTL). GP-models (according to Eq. 1) were trained and predictions were cross-validated according to the methods described in previous subsections. One set of models where the presumed major-effect locus was included as a fixed regression effect (GBLUP-F) was compared to a corresponding set of models were the major-effect locus was not included (GBLUP).